1
00:00:00,120 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:03,910
Commons license.

3
00:00:03,910 --> 00:00:06,950
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,950 --> 00:00:10,600
offer high quality educational
resources for free.

5
00:00:10,600 --> 00:00:13,500
To make a donation or view
additional materials from

6
00:00:13,500 --> 00:00:17,430
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,430 --> 00:00:18,680
ocw.mit.edu.

8
00:00:23,660 --> 00:00:25,450
PROFESSOR: Let's get started.

9
00:00:25,450 --> 00:00:29,750
So today, I'm going to talk
a little bit more about

10
00:00:29,750 --> 00:00:33,200
performance issues in
parallelization.

11
00:00:33,200 --> 00:00:36,860
A little bit more out
of the [INAUDIBLE]

12
00:00:36,860 --> 00:00:39,170
to what people are
doing otherwise.

13
00:00:39,170 --> 00:00:44,030
So normally what we have done so
far, is we looked at Cilk.

14
00:00:44,030 --> 00:00:46,100
It provides a very robust
and environment for

15
00:00:46,100 --> 00:00:46,930
parallelization.

16
00:00:46,930 --> 00:00:50,550
It hides many issues and
eliminates many of the

17
00:00:50,550 --> 00:00:56,120
problems out there if you find
other areas of parallelization

18
00:00:56,120 --> 00:00:57,710
that you deal with.

19
00:00:57,710 --> 00:01:00,290
And in last lectures, we looked
at things like cache

20
00:01:00,290 --> 00:01:01,140
[UNINTELLIGIBLE]

21
00:01:01,140 --> 00:01:04,269
algorithims, algorithmic
issues, like

22
00:01:04,269 --> 00:01:06,710
looking at work and spend.

23
00:01:06,710 --> 00:01:08,790
And in fact, in your projects,
you're going to use all these

24
00:01:08,790 --> 00:01:13,660
nice concepts to get you a nice
parallel learning CorD.

25
00:01:13,660 --> 00:01:20,640
But if you look at a lot of
these CorD [UNINTELLIGIBLE]

26
00:01:20,640 --> 00:01:24,150
the very people normally for
parallelized CorD outside

27
00:01:24,150 --> 00:01:25,800
probably Cilk.

28
00:01:25,800 --> 00:01:30,270
And there are a lot of other
issues that arise, things like

29
00:01:30,270 --> 00:01:33,160
synchronization issues
and memory issues.

30
00:01:33,160 --> 00:01:36,380
Today, I think we are going to
focus mostly on memory issues.

31
00:01:36,380 --> 00:01:39,450
And we are going to use open
OpenMP instead of [INAUDIBLE].

32
00:01:39,450 --> 00:01:42,400
And most of these issues
will be affected on

33
00:01:42,400 --> 00:01:43,710
Cilk sometimes, too.

34
00:01:43,710 --> 00:01:45,990
But Cilk tries to hide
them from you.

35
00:01:45,990 --> 00:01:46,920
There's a layer of abstract.

36
00:01:46,920 --> 00:01:49,940
And it's hard to kind of get
to those issues in there.

37
00:01:49,940 --> 00:01:53,200
So we are going to look at
this thing called OpenMP.

38
00:01:53,200 --> 00:01:55,430
So today, we are going to
address things like

39
00:01:55,430 --> 00:01:58,820
granularity of parallelism.

40
00:01:58,820 --> 00:02:00,140
There are so many things
that just went out

41
00:02:00,140 --> 00:02:01,820
on the page, I guess.

42
00:02:01,820 --> 00:02:05,060
True sharing, false sharing,
load balancing issues, the

43
00:02:05,060 --> 00:02:06,170
[UNINTELLIGIBLE].

44
00:02:06,170 --> 00:02:10,570
So from the license and keep
talking about that we want to

45
00:02:10,570 --> 00:02:13,550
be out of all this not dealing
with Voodoo parameter.

46
00:02:13,550 --> 00:02:16,350
Today, we actually are dealing
mainly with Voodoo.

47
00:02:16,350 --> 00:02:20,460
So I guess this should
be basically

48
00:02:20,460 --> 00:02:21,670
the Halloween lecture.

49
00:02:21,670 --> 00:02:24,703
So we are all about Voodoo today
and see how we can deal

50
00:02:24,703 --> 00:02:27,760
with Voodoo issues.

51
00:02:27,760 --> 00:02:32,690
So if you look at a Cilk
program, here is a nice simple

52
00:02:32,690 --> 00:02:34,740
matrix multiply, seem
to be [INAUDIBLE]

53
00:02:34,740 --> 00:02:36,320
example these days.

54
00:02:36,320 --> 00:02:39,060
What you can do is you can put
a Cilk formula in these two

55
00:02:39,060 --> 00:02:42,070
loops and get a nice parallel
performance.

56
00:02:42,070 --> 00:02:44,870
However, [UNINTELLIGIBLE]

57
00:02:44,870 --> 00:02:47,270
from where how the memory
is arranged is

58
00:02:47,270 --> 00:02:48,820
up to the Cilk scheduler.

59
00:02:48,820 --> 00:02:50,980
Cilk scheduler is doing
some work stealing.

60
00:02:50,980 --> 00:02:54,420
Depending on how the work gets
distributed, the process will

61
00:02:54,420 --> 00:02:56,090
get worked, it will happen.

62
00:02:56,090 --> 00:02:58,370
Hopefully, everything
will go nicely.

63
00:02:58,370 --> 00:03:02,270
And so what that means is it
ties the distribution and load

64
00:03:02,270 --> 00:03:04,060
balancing issues.

65
00:03:04,060 --> 00:03:07,640
So it's nice if you have access
to Cilk, but many other

66
00:03:07,640 --> 00:03:09,050
[UNINTELLIGIBLE] you
might not be.

67
00:03:09,050 --> 00:03:12,320
And even within Cilk, some of
these issues might show up.

68
00:03:12,320 --> 00:03:15,390
So what we are going
to do is step one

69
00:03:15,390 --> 00:03:17,820
below the Cilk scheduler.

70
00:03:17,820 --> 00:03:22,470
So there's this system
called OpenMP.

71
00:03:22,470 --> 00:03:25,640
It's a more simplified
model of parallelism.

72
00:03:25,640 --> 00:03:29,665
So what it tries to do is
instead of giving this very

73
00:03:29,665 --> 00:03:30,460
[UNINTELLIGIBLE]

74
00:03:30,460 --> 00:03:33,930
system, it lets you basically
direct access to the

75
00:03:33,930 --> 00:03:35,110
processors.

76
00:03:35,110 --> 00:03:38,280
So what that means is there's
normally what we call a

77
00:03:38,280 --> 00:03:39,210
fork-join model.

78
00:03:39,210 --> 00:03:39,520
[UNINTELLIGIBLE]

79
00:03:39,520 --> 00:03:43,580
we have with Cilk, basically.

80
00:03:43,580 --> 00:03:46,490
We can do fork into different
workers and join.

81
00:03:46,490 --> 00:03:49,750
And more or less, you can
actually bind these workers to

82
00:03:49,750 --> 00:03:50,420
[UNINTELLIGIBLE]

83
00:03:50,420 --> 00:03:53,510
sometimes or make sure
that the number--

84
00:03:53,510 --> 00:03:56,800
I'll give you some techniques
how to do that as I go on.

85
00:03:56,800 --> 00:04:01,540
So for parallel loops, you can
do data parallelism, different

86
00:04:01,540 --> 00:04:02,340
[UNINTELLIGIBLE]

87
00:04:02,340 --> 00:04:04,580
parallelism you can do something
like fork-join.

88
00:04:04,580 --> 00:04:07,620
And you can see a bunch
of static or

89
00:04:07,620 --> 00:04:11,220
dynamic scheduling policies.

90
00:04:11,220 --> 00:04:16,790
So for example in OpenMP, you
can see for this loop that add

91
00:04:16,790 --> 00:04:17,959
a pragma to [UNINTELLIGIBLE]

92
00:04:17,959 --> 00:04:21,029
in front of this loop and
say this is OpenMP.

93
00:04:21,029 --> 00:04:22,450
Parallel loop in here.

94
00:04:22,450 --> 00:04:24,310
Parallel full loop
in this one.

95
00:04:24,310 --> 00:04:26,990
And schedule it using
static chunk.

96
00:04:26,990 --> 00:04:29,610
I will tell you what
exactly that means.

97
00:04:29,610 --> 00:04:34,680
And that gives you direct access
to how each of these

98
00:04:34,680 --> 00:04:36,730
parts will be run.

99
00:04:36,730 --> 00:04:39,320
So let me get a little
bit in detail.

100
00:04:39,320 --> 00:04:41,730
So assume you have
[UNINTELLIGIBLE] courses in

101
00:04:41,730 --> 00:04:42,950
there, [UNINTELLIGIBLE]
processors.

102
00:04:42,950 --> 00:04:46,270
So now in OpenMP, you are
basically opening the entire

103
00:04:46,270 --> 00:04:46,990
world underneath.

104
00:04:46,990 --> 00:04:49,840
And you have to kind of
see what's going on.

105
00:04:49,840 --> 00:04:55,330
And if you say, schedule a
static chunk of four, assume

106
00:04:55,330 --> 00:04:58,050
you have 16 iterations.

107
00:04:58,050 --> 00:05:00,200
Here are my 16 iterations.

108
00:05:00,200 --> 00:05:02,990
So each of these dots represent
a value for i.

109
00:05:02,990 --> 00:05:07,890
So what it says is you take
chunks of four and basically

110
00:05:07,890 --> 00:05:08,950
send it across it.

111
00:05:08,950 --> 00:05:11,190
So what happens is the first
four iterations will go to

112
00:05:11,190 --> 00:05:13,240
[UNINTELLIGIBLE] or core zero.

113
00:05:13,240 --> 00:05:15,650
Next four will go to core one,
core two and core three.

114
00:05:15,650 --> 00:05:18,870
So you know exactly which
iterations run where.

115
00:05:18,870 --> 00:05:20,660
It's a very static thing.

116
00:05:20,660 --> 00:05:23,100
You have full control
of what's going on.

117
00:05:23,100 --> 00:05:26,980
Whereas in Cilk, it's
up to the scheduler.

118
00:05:26,980 --> 00:05:29,090
So the nice thing here is you
can have full control.

119
00:05:29,090 --> 00:05:30,960
But you get enough room
to harm yourself if

120
00:05:30,960 --> 00:05:32,260
you do things wrong.

121
00:05:32,260 --> 00:05:36,510
So this is a double-edged
sword in that sense.

122
00:05:36,510 --> 00:05:39,990
So instead of doing static
five you do static two.

123
00:05:39,990 --> 00:05:43,010
You're assigning chunks
of size two.

124
00:05:43,010 --> 00:05:46,125
What it will do is it will
assign chunks of size two to

125
00:05:46,125 --> 00:05:48,450
the four cores.

126
00:05:48,450 --> 00:05:49,770
And then you're not done yet.

127
00:05:49,770 --> 00:05:52,620
And then you start with again
core zero and assign

128
00:05:52,620 --> 00:05:53,980
chunks of size two.

129
00:05:53,980 --> 00:05:57,880
This is called block
cyclic schedule.

130
00:05:57,880 --> 00:05:59,880
And if you do a chunk
of size one, it's

131
00:05:59,880 --> 00:06:00,890
called a cyclic schedule.

132
00:06:00,890 --> 00:06:08,190
[UNINTELLIGIBLE] cycles just
assigning iterations to cores.

133
00:06:08,190 --> 00:06:08,830
OK.

134
00:06:08,830 --> 00:06:11,700
So far so good?

135
00:06:11,700 --> 00:06:13,780
OK.

136
00:06:13,780 --> 00:06:15,780
So I want to do something.

137
00:06:15,780 --> 00:06:19,740
So I have this program.

138
00:06:19,740 --> 00:06:21,790
This is again your

139
00:06:21,790 --> 00:06:24,350
run-of-the-mill matrix multiply.

140
00:06:24,350 --> 00:06:30,070
And I ran a sequential single
machine, and I got this

141
00:06:30,070 --> 00:06:31,860
performance.

142
00:06:31,860 --> 00:06:34,750
Then, I said look, I want to
parallelize the outer loop.

143
00:06:34,750 --> 00:06:37,640
So I parallelize this loop.

144
00:06:37,640 --> 00:06:40,710
What should I get?

145
00:06:40,710 --> 00:06:41,790
[UNINTELLIGIBLE] fast or slow.

146
00:06:41,790 --> 00:06:46,970
I want to just check whether
you are awake or sleep.

147
00:06:46,970 --> 00:06:47,890
How do [UNINTELLIGIBLE PHRASE]

148
00:06:47,890 --> 00:06:49,140
to run slower.

149
00:06:53,030 --> 00:06:54,370
It's not a trick question.

150
00:06:54,370 --> 00:06:57,460
This is just to make sure that
actually participate.

151
00:06:57,460 --> 00:06:59,230
How do people think
it's run faster?

152
00:06:59,230 --> 00:07:00,950
AUDIENCE: [INAUDIBLE]

153
00:07:00,950 --> 00:07:01,100
PROFESSOR: OK.

154
00:07:01,100 --> 00:07:02,390
Good.

155
00:07:02,390 --> 00:07:05,234
What do others think?

156
00:07:05,234 --> 00:07:06,070
OK.

157
00:07:06,070 --> 00:07:08,980
They're probably checking their
email or something.

158
00:07:08,980 --> 00:07:10,550
OK.

159
00:07:10,550 --> 00:07:14,340
So actually it ran faster.

160
00:07:14,340 --> 00:07:18,910
The source not run on the
common cloud machines.

161
00:07:18,910 --> 00:07:20,990
This was a previous generation
that I ran.

162
00:07:20,990 --> 00:07:23,390
So [UNINTELLIGIBLE] was
seven times faster.

163
00:07:23,390 --> 00:07:24,290
So this was great.

164
00:07:24,290 --> 00:07:25,670
I parallized outer loop.

165
00:07:25,670 --> 00:07:28,390
What happens if I parallize
inner loop?

166
00:07:28,390 --> 00:07:31,570
So this test, this i loop,
runs parallel.

167
00:07:31,570 --> 00:07:34,060
Here, I launch the
[UNINTELLIGIBLE] parallel.

168
00:07:34,060 --> 00:07:38,138
How much people thinks
this runs faster?

169
00:07:38,138 --> 00:07:39,970
AUDIENCE: [INAUDIBLE]

170
00:07:39,970 --> 00:07:42,800
PROFESSOR: Compared
to this one.

171
00:07:42,800 --> 00:07:45,290
How many people thinks
this runs slower?

172
00:07:45,290 --> 00:07:46,530
OK.

173
00:07:46,530 --> 00:07:50,530
There's some consistent
answers here.

174
00:07:50,530 --> 00:07:52,956
Why do you think it
would run slower?

175
00:07:56,130 --> 00:07:57,720
So OK.

176
00:07:57,720 --> 00:08:00,180
It ran slower, so it
can improve that.

177
00:08:00,180 --> 00:08:02,270
And that's a little bit slow.

178
00:08:02,270 --> 00:08:05,774
Why is it slow?

179
00:08:05,774 --> 00:08:08,250
AUDIENCE: [INAUDIBLE]

180
00:08:08,250 --> 00:08:09,050
PROFESSOR: Exactly.

181
00:08:09,050 --> 00:08:12,520
So what it's doing here, it's
basically spawning many, many

182
00:08:12,520 --> 00:08:13,200
times in here.

183
00:08:13,200 --> 00:08:17,530
Because every time you have
parallelism, you chunkify into

184
00:08:17,530 --> 00:08:18,500
the processor.

185
00:08:18,500 --> 00:08:21,450
Here you are getting a lot more
smaller chunks inside.

186
00:08:21,450 --> 00:08:24,700
So let's look at how this
is basically run.

187
00:08:24,700 --> 00:08:31,075
So normally, you can think about
an OpenMP program as you

188
00:08:31,075 --> 00:08:32,549
have one sequential thread.

189
00:08:32,549 --> 00:08:34,000
You run the main program.

190
00:08:34,000 --> 00:08:37,760
And then assume you have, in
cores, you might have n minus

191
00:08:37,760 --> 00:08:42,100
1 other thread just
waiting for work.

192
00:08:42,100 --> 00:08:45,270
And then, when you finally
come to the parallel

193
00:08:45,270 --> 00:08:48,180
loop, it says, OK.

194
00:08:48,180 --> 00:08:48,730
Set up.

195
00:08:48,730 --> 00:08:52,080
What do you want to run
on other basic cores.

196
00:08:52,080 --> 00:08:53,510
And release it.

197
00:08:53,510 --> 00:08:55,360
Release these waiting people.

198
00:08:55,360 --> 00:08:57,020
And let them start working
on the parallel work.

199
00:08:57,020 --> 00:08:59,650
And also, I will start doing
on my own chunk.

200
00:08:59,650 --> 00:09:03,720
So suddenly, when you say
parallel four, it releases all

201
00:09:03,720 --> 00:09:07,310
other cores to go run that
part of the core.

202
00:09:07,310 --> 00:09:11,740
And once it's done, it's
will say, OK, I'm done.

203
00:09:11,740 --> 00:09:13,140
I have to wait until
everybody is done.

204
00:09:13,140 --> 00:09:16,570
So even if the main guy is done,
it has to wait until

205
00:09:16,570 --> 00:09:18,200
everybody is finished.

206
00:09:18,200 --> 00:09:22,610
And then, start executing the
sequence [UNINTELLIGIBLE].

207
00:09:22,610 --> 00:09:27,670
So this is the gist of how
OpenMP program is run.

208
00:09:27,670 --> 00:09:30,840
And if you realize that it all
heads here because you have

209
00:09:30,840 --> 00:09:33,150
basically make sure all these
cases are broken up.

210
00:09:33,150 --> 00:09:35,740
So there's some things that
has to be issued.

211
00:09:35,740 --> 00:09:38,590
And there's a delay between
these guys can start if

212
00:09:38,590 --> 00:09:39,990
everybody has equal work.

213
00:09:39,990 --> 00:09:42,200
Despite not finishing on time
because it may take some time

214
00:09:42,200 --> 00:09:43,430
for this to start.

215
00:09:43,430 --> 00:09:46,490
And then, it has to also tell
this back OK, I am done.

216
00:09:46,490 --> 00:09:48,320
So there's a lot of
synchronization going on.

217
00:09:48,320 --> 00:09:49,390
Locks and unlocks.

218
00:09:49,390 --> 00:09:51,020
Here it's called various
synchronization here.

219
00:09:54,085 --> 00:09:58,400
And so if this work is small,
this synchronization starts

220
00:09:58,400 --> 00:10:00,070
dominating.

221
00:10:00,070 --> 00:10:03,530
So what happens is
[UNINTELLIGIBLE]

222
00:10:03,530 --> 00:10:05,490
fine grain parallelism.

223
00:10:05,490 --> 00:10:08,090
Do a little work in the
parallel region, and

224
00:10:08,090 --> 00:10:10,880
synchronization will basically
start dominating your time.

225
00:10:10,880 --> 00:10:13,310
So how do you take this?

226
00:10:13,310 --> 00:10:15,880
And sometimes when you run
something parallel, it might

227
00:10:15,880 --> 00:10:19,150
even run slow because the
amount of stuff in the

228
00:10:19,150 --> 00:10:22,420
parallel region is so small,
[UNINTELLIGIBLE] will start

229
00:10:22,420 --> 00:10:22,760
dominating.

230
00:10:22,760 --> 00:10:24,010
And that's not a good way.

231
00:10:24,010 --> 00:10:26,700
And also, sometimes
you assume.

232
00:10:26,700 --> 00:10:29,240
And you keep increasing
the number of cores.

233
00:10:29,240 --> 00:10:32,750
Hopefully, you want to see a
nice parallelism increase, but

234
00:10:32,750 --> 00:10:35,090
it doesn't, even though you
have enough information.

235
00:10:35,090 --> 00:10:37,880
But that means you're running
a lot of small chunks, even

236
00:10:37,880 --> 00:10:39,390
though you seem to have a lot
of parallelism available.

237
00:10:42,730 --> 00:10:46,160
And also, you can make sure
the synchronization in the

238
00:10:46,160 --> 00:10:47,210
time in the parallel region.

239
00:10:47,210 --> 00:10:48,720
If the parallel regions
are on a very

240
00:10:48,720 --> 00:10:50,670
short time, this happens.

241
00:10:50,670 --> 00:10:54,780
We saw this effect when
we were doing Cilk.

242
00:10:54,780 --> 00:10:56,030
Remember?

243
00:10:56,030 --> 00:10:59,450
When did we see this granularity
affecting Cilk?

244
00:10:59,450 --> 00:11:00,750
And what did he do?

245
00:11:03,900 --> 00:11:04,910
When you write Cilk programs.

246
00:11:04,910 --> 00:11:05,660
You write [UNINTELLIGIBLE]

247
00:11:05,660 --> 00:11:06,950
programs.

248
00:11:06,950 --> 00:11:12,500
Where did the granularity
start showing up on us?

249
00:11:12,500 --> 00:11:15,160
It may not be exactly this
because the scheduling is

250
00:11:15,160 --> 00:11:15,430
complicated.

251
00:11:15,430 --> 00:11:15,890
OK.

252
00:11:15,890 --> 00:11:16,380
Yes?

253
00:11:16,380 --> 00:11:19,280
AUDIENCE: The two by two
matrix [INAUDIBLE]

254
00:11:19,280 --> 00:11:19,720
PROFESSOR: Yeah.

255
00:11:19,720 --> 00:11:20,420
Something like two by--

256
00:11:20,420 --> 00:11:22,800
for example, that's the reason
we wanted to have

257
00:11:22,800 --> 00:11:24,580
a large base case.

258
00:11:24,580 --> 00:11:26,900
Because if you didn't put a
large base case, it keeps

259
00:11:26,900 --> 00:11:30,110
dividing into smaller and
smaller problems.

260
00:11:30,110 --> 00:11:32,580
And if the schedule is
smart, it won't be

261
00:11:32,580 --> 00:11:33,700
doing exactly this.

262
00:11:33,700 --> 00:11:37,350
But it's always good to have
these large granulated chunks

263
00:11:37,350 --> 00:11:38,620
at the bottom.

264
00:11:41,870 --> 00:11:43,890
So how to get [UNINTELLIGIBLE]

265
00:11:43,890 --> 00:11:45,560
granulated parallelism.

266
00:11:45,560 --> 00:11:48,080
What we need to do is reduce the
number of [UNINTELLIGIBLE]

267
00:11:48,080 --> 00:11:48,770
equations.

268
00:11:48,770 --> 00:11:51,760
So you want to always try to
look for the outer most loop

269
00:11:51,760 --> 00:11:55,080
you can get at all the really
large independent regions.

270
00:11:55,080 --> 00:11:57,040
So you go look, and not
[UNINTELLIGIBLE]

271
00:11:57,040 --> 00:11:57,930
thing you want to parallelize.

272
00:11:57,930 --> 00:12:01,165
You go up, up, up, up until the
point you can parallelize.

273
00:12:01,165 --> 00:12:05,490
And that's the best way to
get good performance.

274
00:12:05,490 --> 00:12:06,740
OK?

275
00:12:09,580 --> 00:12:13,850
So if you really compare these
three programs here, again,

276
00:12:13,850 --> 00:12:15,210
what you see--

277
00:12:15,210 --> 00:12:16,720
of course, this has no
synchronization.

278
00:12:16,720 --> 00:12:19,400
This has n amount of
synchronizations.

279
00:12:19,400 --> 00:12:20,870
Here in [UNINTELLIGIBLE]
synchronization, that's

280
00:12:20,870 --> 00:12:23,210
obviously a lot more
synchronization going on.

281
00:12:23,210 --> 00:12:27,190
And that is where this
[UNINTELLIGIBLE] comes from.

282
00:12:27,190 --> 00:12:27,620
OK.

283
00:12:27,620 --> 00:12:30,260
So now, I am switching
a little bit in here.

284
00:12:30,260 --> 00:12:33,450
I want you guys to look at this
program a little bit.

285
00:12:33,450 --> 00:12:35,000
So what am I doing here?

286
00:12:35,000 --> 00:12:37,580
I have two [UNINTELLIGIBLE].

287
00:12:37,580 --> 00:12:45,110
And I am just basically adding
matrix B to matrix A. OK?

288
00:12:45,110 --> 00:12:48,720
And then I have another loop
test here, adding matrix C to

289
00:12:48,720 --> 00:12:52,120
matrix A. And what am
I doing in here?

290
00:12:52,120 --> 00:12:55,740
I am basically going through
matrix A in another

291
00:12:55,740 --> 00:12:59,020
direction in here.

292
00:12:59,020 --> 00:13:00,270
AUDIENCE: [INAUDIBLE]

293
00:13:02,310 --> 00:13:03,700
PROFESSOR: It's not really
a transpose.

294
00:13:03,700 --> 00:13:04,590
I'm not transposing.

295
00:13:04,590 --> 00:13:08,830
What I'm doing is I'm actually
doing a mirror because the C

296
00:13:08,830 --> 00:13:10,210
gets mirrored on ix.

297
00:13:10,210 --> 00:13:11,030
It's because [UNINTELLIGIBLE]

298
00:13:11,030 --> 00:13:11,590
ix [UNINTELLIGIBLE]

299
00:13:11,590 --> 00:13:12,425
the other direction.

300
00:13:12,425 --> 00:13:12,740
OK?

301
00:13:12,740 --> 00:13:14,400
So it's not really
a transpose.

302
00:13:14,400 --> 00:13:16,960
So I do a mirror addition.

303
00:13:16,960 --> 00:13:19,010
And then I'm asking
for the two outer

304
00:13:19,010 --> 00:13:22,130
most loops to be parallel.

305
00:13:22,130 --> 00:13:25,440
So if you run this
sequential--

306
00:13:25,440 --> 00:13:29,870
OK, you get about 30
milliseconds, I

307
00:13:29,870 --> 00:13:32,120
guess, to run in here.

308
00:13:32,120 --> 00:13:33,700
So that is in [UNINTELLIGIBLE].

309
00:13:33,700 --> 00:13:35,580
But if you're running parallel,
what do you get?

310
00:13:35,580 --> 00:13:36,840
Should you get faster
or slower?

311
00:13:40,960 --> 00:13:42,310
OK.

312
00:13:42,310 --> 00:13:45,200
Anyone want to take a guess
[UNINTELLIGIBLE]

313
00:13:45,200 --> 00:13:46,995
Sometimes some of these
questions, you might not have

314
00:13:46,995 --> 00:13:48,040
enough information to answer.

315
00:13:48,040 --> 00:13:50,750
But it's still good to just take
a stand on one direction

316
00:13:50,750 --> 00:13:51,450
or another.

317
00:13:51,450 --> 00:13:54,120
How many people think
it runs faster?

318
00:13:54,120 --> 00:13:56,350
How many people think
it runs slower?

319
00:13:56,350 --> 00:13:57,750
OK.

320
00:13:57,750 --> 00:13:59,550
Some.

321
00:13:59,550 --> 00:14:01,940
Oops.

322
00:14:01,940 --> 00:14:03,190
What happened?

323
00:14:07,310 --> 00:14:09,140
What happened in here?

324
00:14:19,750 --> 00:14:25,150
Can anybody point out why it
might be running slower

325
00:14:25,150 --> 00:14:27,880
parallely than running
sequentially?

326
00:14:31,275 --> 00:14:33,220
AUDIENCE: [INAUDIBLE]

327
00:14:33,220 --> 00:14:33,550
PROFESSOR: Yeah.

328
00:14:33,550 --> 00:14:34,760
There's a cache issue.

329
00:14:34,760 --> 00:14:38,611
Watch the possible cache
issue in here.

330
00:14:38,611 --> 00:14:41,600
AUDIENCE: [INAUDIBLE]

331
00:14:41,600 --> 00:14:42,850
PROFESSOR: Yeah.

332
00:14:44,610 --> 00:14:50,610
If you think about, the first
equations of, I guess, the

333
00:14:50,610 --> 00:14:51,080
first core--

334
00:14:51,080 --> 00:14:52,230
I have some diagram.

335
00:14:52,230 --> 00:14:53,740
I'll show it to you in there.

336
00:14:53,740 --> 00:14:56,850
And here, only the last data
elements we'll get for the

337
00:14:56,850 --> 00:14:58,760
first iterations because
we are going

338
00:14:58,760 --> 00:14:59,970
in the other direction.

339
00:14:59,970 --> 00:15:03,500
So if you look at it a little
more deeply into

340
00:15:03,500 --> 00:15:04,980
what's going on.

341
00:15:04,980 --> 00:15:08,710
Number of instructions seem
to be a little higher.

342
00:15:08,710 --> 00:15:11,310
This one I couldn't actually
explain why this might be the

343
00:15:11,310 --> 00:15:13,270
case in here.

344
00:15:13,270 --> 00:15:15,530
If anybody has an idea,
you can say that.

345
00:15:15,530 --> 00:15:18,620
But this was kind of
[UNINTELLIGIBLE].

346
00:15:18,620 --> 00:15:21,500
This might be [UNINTELLIGIBLE]
the cycles.

347
00:15:21,500 --> 00:15:22,750
Huh.

348
00:15:27,830 --> 00:15:28,600
OK.

349
00:15:28,600 --> 00:15:30,610
I can explain this.

350
00:15:30,610 --> 00:15:34,730
Because this is a sequential
run, this is a sum total of a

351
00:15:34,730 --> 00:15:36,380
parallel run.

352
00:15:36,380 --> 00:15:40,560
So because of all the overhead
that happens because this was

353
00:15:40,560 --> 00:15:43,200
running on, I think, an
eight core machine.

354
00:15:43,200 --> 00:15:45,630
So you're running eight times
of small companies.

355
00:15:45,630 --> 00:15:48,420
There's a lot of overhead that
goes around, synchronization,

356
00:15:48,420 --> 00:15:49,320
and stuff like that.

357
00:15:49,320 --> 00:15:52,050
So a number of instructions
just blows up.

358
00:15:52,050 --> 00:15:55,998
But for each core, you don't
have this blow up.

359
00:15:55,998 --> 00:15:57,248
AUDIENCE: [INAUDIBLE]

360
00:15:59,484 --> 00:15:59,982
Cilk?

361
00:15:59,982 --> 00:16:03,468
Because does Cilk have different
processor affinity,

362
00:16:03,468 --> 00:16:06,456
things that open
[UNINTELLIGIBLE]?

363
00:16:06,456 --> 00:16:11,960
Because it seems like if the
program, the language--

364
00:16:11,960 --> 00:16:13,786
PROFESSOR: [INAUDIBLE].

365
00:16:13,786 --> 00:16:16,610
Let's see if we can process
the affinity

366
00:16:16,610 --> 00:16:18,670
information or if not.

367
00:16:18,670 --> 00:16:21,330
It's just pure [UNINTELLIGIBLE].

368
00:16:21,330 --> 00:16:23,780
AUDIENCE: [INAUDIBLE]

369
00:16:23,780 --> 00:16:24,600
PROFESSOR: Yeah.

370
00:16:24,600 --> 00:16:28,430
I mean if you like executed
locally if you have good cache

371
00:16:28,430 --> 00:16:28,900
[UNINTELLIGIBLE]

372
00:16:28,900 --> 00:16:29,210
with them.

373
00:16:29,210 --> 00:16:31,060
But if there's no cache
[UNINTELLIGIBLE]

374
00:16:31,060 --> 00:16:32,670
you might steal something
where data might

375
00:16:32,670 --> 00:16:33,887
be somewhere else.

376
00:16:33,887 --> 00:16:37,296
AUDIENCE: But you'll still
mimic the cache behavior,

377
00:16:37,296 --> 00:16:42,166
considerably, except
for when you steal.

378
00:16:42,166 --> 00:16:44,601
[INAUDIBLE]

379
00:16:44,601 --> 00:16:45,088
PROFESSOR: Yeah.

380
00:16:45,088 --> 00:16:46,070
So OK.

381
00:16:46,070 --> 00:16:49,450
We don't have a mic in here.

382
00:16:49,450 --> 00:16:49,640
OK.

383
00:16:49,640 --> 00:16:50,570
There's a mic.

384
00:16:50,570 --> 00:16:51,820
There we go.

385
00:16:54,380 --> 00:16:57,110
But if you have two different of
these regions, the way the

386
00:16:57,110 --> 00:17:01,985
parallelization happens
can be different.

387
00:17:01,985 --> 00:17:06,440
AUDIENCE: In Cilk, what
happens is the code is

388
00:17:06,440 --> 00:17:10,569
mimicking, for the most part,
exactly what the C or C++ code

389
00:17:10,569 --> 00:17:11,930
would be doing.

390
00:17:11,930 --> 00:17:15,339
And so you get exactly the same
cache hits and misses.

391
00:17:15,339 --> 00:17:19,690
Except when you steal, it's
like starting over with an

392
00:17:19,690 --> 00:17:21,550
empty cache.

393
00:17:21,550 --> 00:17:21,770
OK?

394
00:17:21,770 --> 00:17:24,589
But as long as you have
sufficient parallelism, the

395
00:17:24,589 --> 00:17:29,080
steals don't occur very often.

396
00:17:29,080 --> 00:17:31,520
And so therefore, you end up
getting the same kind of

397
00:17:31,520 --> 00:17:33,740
behavior that you would get
out of the serial code.

398
00:17:33,740 --> 00:17:34,120
PROFESSOR: Yeah.

399
00:17:34,120 --> 00:17:37,250
But Charles, in this one,
because you had to steal

400
00:17:37,250 --> 00:17:41,240
everything here, the between
here and here, the parallelism

401
00:17:41,240 --> 00:17:41,830
would be different.

402
00:17:41,830 --> 00:17:43,220
AUDIENCE: There would be no
affinity between those two.

403
00:17:43,220 --> 00:17:44,380
PROFESSOR: No [? affinity ?]
will be there.

404
00:17:44,380 --> 00:17:45,300
Yeah, exactly.

405
00:17:45,300 --> 00:17:48,180
So in the sequential one,
everything that fits in the

406
00:17:48,180 --> 00:17:50,720
cache, so that would be affinity
because we are not

407
00:17:50,720 --> 00:17:51,790
doing parallelism.

408
00:17:51,790 --> 00:17:53,510
And that's what I think
happened here.

409
00:17:53,510 --> 00:17:54,990
Because you had no
[? affinity-- ?]

410
00:17:54,990 --> 00:17:57,960
AUDIENCE: No in a serial code,
there's no [? affinity. ?]

411
00:17:57,960 --> 00:17:58,070
PROFESSOR: No.

412
00:17:58,070 --> 00:18:00,920
Serial code-- if this fits in
the cache, it's then running

413
00:18:00,920 --> 00:18:03,090
on one core.

414
00:18:03,090 --> 00:18:07,900
So if it fits in the one core's
cache, you're happy.

415
00:18:07,900 --> 00:18:10,770
AUDIENCE: So the issue is--
right-- is if you only access

416
00:18:10,770 --> 00:18:14,860
it once, by the time you
fill up the cache--

417
00:18:14,860 --> 00:18:16,600
It takes some time to fill
up the cache to get them

418
00:18:16,600 --> 00:18:17,280
synchronized.

419
00:18:17,280 --> 00:18:19,050
PROFESSOR: So it fits in the
one core's cache, it's OK.

420
00:18:19,050 --> 00:18:21,210
Otherwise, it has no
affinity in here.

421
00:18:21,210 --> 00:18:26,290
So the key difference in here
is, of course, CPI is slow.

422
00:18:26,290 --> 00:18:27,530
We don't know exactly why.

423
00:18:27,530 --> 00:18:28,910
But in [UNINTELLIGIBLE].

424
00:18:28,910 --> 00:18:31,480
So what you find is that there's
a huge amount of cache

425
00:18:31,480 --> 00:18:32,150
in [UNINTELLIGIBLE]

426
00:18:32,150 --> 00:18:33,660
going on.

427
00:18:33,660 --> 00:18:35,660
So that should give you a
feeling of what's going on.

428
00:18:35,660 --> 00:18:37,370
So let's look at what
might happen.

429
00:18:37,370 --> 00:18:42,070
So I'm showing this matches
[UNINTELLIGIBLE]

430
00:18:42,070 --> 00:18:46,950
last year on-- what we had were
Cagnode machines that

431
00:18:46,950 --> 00:18:49,130
were basically code
to quad processor.

432
00:18:49,130 --> 00:18:50,620
So we had eight codes in here.

433
00:18:50,620 --> 00:18:52,956
And I put them-- you
don't have to now

434
00:18:52,956 --> 00:18:53,680
look at this table.

435
00:18:53,680 --> 00:18:57,210
I put them in the slides so
you can look at it later.

436
00:18:57,210 --> 00:18:58,920
And so this is the last
year's machine.

437
00:18:58,920 --> 00:19:01,360
And of course, this year's
machine is different.

438
00:19:01,360 --> 00:19:04,610
We have two six core
processors in here.

439
00:19:04,610 --> 00:19:06,330
So this is what we
[UNINTELLIGIBLE]

440
00:19:06,330 --> 00:19:07,650
this year.

441
00:19:07,650 --> 00:19:09,380
[UNINTELLIGIBLE]

442
00:19:09,380 --> 00:19:10,390
OK.

443
00:19:10,390 --> 00:19:13,670
And so right now, I'm showing
numbers for this one.

444
00:19:13,670 --> 00:19:15,830
And later, I will show
what happened in the

445
00:19:15,830 --> 00:19:16,770
[UNINTELLIGIBLE].

446
00:19:16,770 --> 00:19:19,360
So if you look at a cache--

447
00:19:19,360 --> 00:19:25,390
so what happened is each of the
data items in the cache

448
00:19:25,390 --> 00:19:27,340
can be in multiple states.

449
00:19:27,340 --> 00:19:29,600
This is called MSI
protocol here.

450
00:19:29,600 --> 00:19:33,120
What that means is the item
might be modified.

451
00:19:33,120 --> 00:19:36,490
If it is modified, it can
be only in one cache.

452
00:19:36,490 --> 00:19:40,220
If anybody else wants to touch
it, it has to get it out of

453
00:19:40,220 --> 00:19:42,340
the modified state.

454
00:19:42,340 --> 00:19:43,670
Or it can be sharing.

455
00:19:43,670 --> 00:19:46,190
Sharing means it's reading.

456
00:19:46,190 --> 00:19:49,780
So that means that item is
read by multiple people.

457
00:19:49,780 --> 00:19:51,160
And that can have
multiple covers.

458
00:19:51,160 --> 00:19:53,375
So sharing items can be
in multiple places.

459
00:19:55,890 --> 00:19:59,220
However if you're modifying,
[UNINTELLIGIBLE]

460
00:19:59,220 --> 00:20:00,890
items in everybody else.

461
00:20:00,890 --> 00:20:03,130
So that means if I modify
something, it

462
00:20:03,130 --> 00:20:04,230
can only be in mine.

463
00:20:04,230 --> 00:20:06,540
If other people had that
data, I had to go in

464
00:20:06,540 --> 00:20:07,070
and validate this.

465
00:20:07,070 --> 00:20:09,470
So if you modify this, I had
to go in and validate it.

466
00:20:09,470 --> 00:20:10,740
So that's a sharing state.

467
00:20:10,740 --> 00:20:11,910
That means I'm [UNINTELLIGIBLE]
everybody

468
00:20:11,910 --> 00:20:13,130
[UNINTELLIGIBLE]

469
00:20:13,130 --> 00:20:13,880
read this.

470
00:20:13,880 --> 00:20:16,290
But if I ever want to change
that, I have to go in and

471
00:20:16,290 --> 00:20:17,780
validate this one.

472
00:20:17,780 --> 00:20:20,530
So what that means is when I
start writing, I am validating

473
00:20:20,530 --> 00:20:21,750
it from everybody.

474
00:20:21,750 --> 00:20:25,980
So even if everybody kept a copy
and they start modifying,

475
00:20:25,980 --> 00:20:28,540
I had to get my own copy.

476
00:20:28,540 --> 00:20:30,000
And everybody else
will invalidate.

477
00:20:30,000 --> 00:20:32,260
And then, if somebody else
wanted to read--

478
00:20:32,260 --> 00:20:34,410
for example, if this
guy wants to read--

479
00:20:34,410 --> 00:20:35,960
basically, this has
to make this a

480
00:20:35,960 --> 00:20:37,950
sharing and back to sharing.

481
00:20:37,950 --> 00:20:40,080
That means I have to get the
value 13, propogate it, and

482
00:20:40,080 --> 00:20:43,100
this becomes sharing again.

483
00:20:43,100 --> 00:20:43,560
OK.

484
00:20:43,560 --> 00:20:44,600
Did you get that?

485
00:20:44,600 --> 00:20:45,770
What's going on in here?

486
00:20:45,770 --> 00:20:47,520
In the cache?

487
00:20:47,520 --> 00:20:51,050
So reads, everybody can keep
a copy if they want.

488
00:20:51,050 --> 00:20:53,530
Write-- only one guy
can keep a copy.

489
00:20:53,530 --> 00:20:55,730
So what happens then
is true sharing.

490
00:20:55,730 --> 00:20:58,280
So you have these two
different cores.

491
00:20:58,280 --> 00:20:59,550
So I want to read something.

492
00:20:59,550 --> 00:21:02,500
So I get it from outside
probably on main memory, and I

493
00:21:02,500 --> 00:21:04,820
put it in my cache in here.

494
00:21:04,820 --> 00:21:09,790
And then, the next guy wants
to write the same thing.

495
00:21:09,790 --> 00:21:11,200
Assume I'm writing that.

496
00:21:11,200 --> 00:21:14,520
And once I want to write that,
I can keep this copy I had

497
00:21:14,520 --> 00:21:16,920
invalidated from here
and get a copy here.

498
00:21:16,920 --> 00:21:19,130
And then, if this way, I want
to write it again, I have to

499
00:21:19,130 --> 00:21:21,140
basically invalidate it from
here and get a copy.

500
00:21:21,140 --> 00:21:23,400
If I'm reading, both of us can
keep a copy and just kind of

501
00:21:23,400 --> 00:21:25,570
keep bouncing back and forth,
back and forth.

502
00:21:25,570 --> 00:21:30,350
And so if you bounce too many
times, you get all of these

503
00:21:30,350 --> 00:21:31,520
invalidations.

504
00:21:31,520 --> 00:21:34,770
So the fact I looked at that I
have invalidations basically

505
00:21:34,770 --> 00:21:37,770
tells me something like
this is going on.

506
00:21:37,770 --> 00:21:39,505
So what's happening
in this program?

507
00:21:42,070 --> 00:21:46,420
When I parallelize this four
loop, my four cores--

508
00:21:46,420 --> 00:21:49,080
basically since I am doing
here [UNINTELLIGIBLE]--

509
00:21:49,080 --> 00:21:51,710
are going to get this
nice distribution of

510
00:21:51,710 --> 00:21:53,690
data into the caches.

511
00:21:53,690 --> 00:21:55,900
Assume it fits in cache.

512
00:21:55,900 --> 00:21:56,140
OK.

513
00:21:56,140 --> 00:21:59,050
So all this data nicely fits
into cache, and now I'm pretty

514
00:21:59,050 --> 00:22:00,760
happy when I run this one
because I got this

515
00:22:00,760 --> 00:22:01,490
data into the cache.

516
00:22:01,490 --> 00:22:02,910
And I write it.

517
00:22:02,910 --> 00:22:07,860
But the minute I go in here,
basically this data has to

518
00:22:07,860 --> 00:22:09,110
[UNINTELLIGIBLE].

519
00:22:10,710 --> 00:22:14,690
OK, because I am going this is n
minus i in here so this data

520
00:22:14,690 --> 00:22:15,520
has a flip route.

521
00:22:15,520 --> 00:22:19,380
And by doing this, basically,
I incur this huge amount of

522
00:22:19,380 --> 00:22:22,090
[UNINTELLIGIBLE], and
it slows down.

523
00:22:22,090 --> 00:22:22,670
OK?

524
00:22:22,670 --> 00:22:25,250
So that's why it didn't
work well.

525
00:22:25,250 --> 00:22:28,990
So what can you do?

526
00:22:33,580 --> 00:22:38,230
When you have these read,
write and write, write

527
00:22:38,230 --> 00:22:40,050
conflicts in here.

528
00:22:40,050 --> 00:22:44,420
And you have to actually move
the data across in here.

529
00:22:44,420 --> 00:22:48,450
And what you can do is look
for this true sharing.

530
00:22:48,450 --> 00:22:49,550
You can look at the
[UNINTELLIGIBLE]

531
00:22:49,550 --> 00:22:51,030
and see if we have excessive

532
00:22:51,030 --> 00:22:52,910
[UNINTELLIGIBLE], we have a problem.

533
00:22:52,910 --> 00:22:54,990
And how do we eliminate that?

534
00:22:54,990 --> 00:22:56,240
You want to make the
sharing minimal.

535
00:22:59,580 --> 00:23:01,984
If you want to get some data
into a cache, you want to try

536
00:23:01,984 --> 00:23:04,320
to keep it there as
much as possible.

537
00:23:04,320 --> 00:23:06,050
And if you're using, you'd
want to try to align

538
00:23:06,050 --> 00:23:08,080
everything across.

539
00:23:08,080 --> 00:23:11,140
So even across different
regions, it'll use the same

540
00:23:11,140 --> 00:23:12,950
kind of things.

541
00:23:12,950 --> 00:23:15,120
And/or enforce some kind of
[UNINTELLIGIBLE] technique to

542
00:23:15,120 --> 00:23:15,900
keep the data alive.

543
00:23:15,900 --> 00:23:18,045
So there are a lot of techniques
in there, but true

544
00:23:18,045 --> 00:23:19,900
sharing can be an interesting
problem.

545
00:23:19,900 --> 00:23:23,090
So in here, simple
change, yes.

546
00:23:23,090 --> 00:23:28,380
You're, basically, instead of
changing A, you change C. So

547
00:23:28,380 --> 00:23:30,130
you write A the same way.

548
00:23:30,130 --> 00:23:33,580
But now what I have done is
I am doing the mirror by

549
00:23:33,580 --> 00:23:36,760
changing the axis to C, is to
[UNINTELLIGIBLE] is the same

550
00:23:36,760 --> 00:23:38,590
as this axis.

551
00:23:38,590 --> 00:23:40,070
So these two are
the same thing.

552
00:23:40,070 --> 00:23:42,620
And the minute I
do that, voila!

553
00:23:42,620 --> 00:23:44,210
I get good speed up.

554
00:23:44,210 --> 00:23:47,290
Because if you look at that, my
inundations has gone down.

555
00:23:47,290 --> 00:23:48,640
My L1 cache [UNINTELLIGIBLE]
has now

556
00:23:48,640 --> 00:23:49,610
really, really gone down.

557
00:23:49,610 --> 00:23:51,030
I'm not doing anything.

558
00:23:51,030 --> 00:23:54,820
And of course, I am doing more
instructions here than this

559
00:23:54,820 --> 00:23:57,220
one because--

560
00:23:57,220 --> 00:24:00,070
I think, the difference between
instruction here and

561
00:24:00,070 --> 00:24:03,220
here is because a lot of times
synchronization operations are

562
00:24:03,220 --> 00:24:05,530
dynamic because in the
[UNINTELLIGIBLE]

563
00:24:05,530 --> 00:24:07,760
miscounted as the instructions,
you are busy

564
00:24:07,760 --> 00:24:09,770
waiting in there.

565
00:24:09,770 --> 00:24:13,350
So this number is not really
a constant number.

566
00:24:13,350 --> 00:24:14,850
OK, question.

567
00:24:14,850 --> 00:24:15,440
AUDIENCE: Not a question.

568
00:24:15,440 --> 00:24:19,020
So another thing one could do
here is do loop fusion.

569
00:24:19,020 --> 00:24:19,530
PROFESSOR: Yes.

570
00:24:19,530 --> 00:24:20,410
Yes.

571
00:24:20,410 --> 00:24:24,100
Here is a nice way of putting
both of the loops into one and

572
00:24:24,100 --> 00:24:24,990
do loop fusion.

573
00:24:24,990 --> 00:24:26,300
And that works.

574
00:24:26,300 --> 00:24:28,230
In this case, you can do that.

575
00:24:28,230 --> 00:24:31,360
AUDIENCE: So loop fusion is
where you take two loops, and

576
00:24:31,360 --> 00:24:33,620
you convert it into one loop.

577
00:24:33,620 --> 00:24:37,050
So in this case, you could have
just written one nest,

578
00:24:37,050 --> 00:24:38,840
which has two things
going on inside.

579
00:24:41,490 --> 00:24:44,580
And then you would save all
the loop overhead and the

580
00:24:44,580 --> 00:24:46,090
scheduling overhead.

581
00:24:46,090 --> 00:24:49,090
So rather than doing it twice,
you actually have reduced the

582
00:24:49,090 --> 00:24:52,890
overhead to just the parallelism
of the one loop.

583
00:24:52,890 --> 00:24:55,790
So if you look at that, you'll
realize that you could somehow

584
00:24:55,790 --> 00:24:59,670
make it just be a single nest
with two statements in there,

585
00:24:59,670 --> 00:25:01,130
rather than one.

586
00:25:01,130 --> 00:25:04,100
PROFESSOR: So basically, instead
of [UNINTELLIGIBLE]

587
00:25:04,100 --> 00:25:08,190
entire thing and move plus
C into here, basically.

588
00:25:08,190 --> 00:25:10,040
And I could have just done
it in one loop nest.

589
00:25:10,040 --> 00:25:11,950
That's what loop fusion
would do here.

590
00:25:11,950 --> 00:25:13,750
So we can actually
[UNINTELLIGIBLE]

591
00:25:13,750 --> 00:25:15,280
much nicer in here.

592
00:25:15,280 --> 00:25:19,190
But just for example purposes,
so now I really reduced this

593
00:25:19,190 --> 00:25:20,540
one and got that.

594
00:25:20,540 --> 00:25:21,740
So this is great.

595
00:25:21,740 --> 00:25:23,830
Cagnodes really showed
this classic

596
00:25:23,830 --> 00:25:25,820
problem in the computer.

597
00:25:25,820 --> 00:25:27,070
And so I'm like, OK.

598
00:25:27,070 --> 00:25:28,720
Now we have new machines.

599
00:25:28,720 --> 00:25:30,075
Let's try it and see
what happens.

600
00:25:34,090 --> 00:25:35,670
What does this show?

601
00:25:35,670 --> 00:25:41,240
This is your nice cloud machines
we've got now.

602
00:25:41,240 --> 00:25:42,480
I have no slow down.

603
00:25:42,480 --> 00:25:47,570
I was really disappointed
because beforehand, I had this

604
00:25:47,570 --> 00:25:50,070
for sharing going on
in here and had a

605
00:25:50,070 --> 00:25:51,100
really big slow down.

606
00:25:51,100 --> 00:25:54,070
But this one, in fact, the
difference is very small.

607
00:25:54,070 --> 00:25:56,540
And when you look at any kind of
performance counters, they

608
00:25:56,540 --> 00:25:58,580
are pretty comparable.

609
00:25:58,580 --> 00:26:00,220
There's nothing much
going on here.

610
00:26:00,220 --> 00:26:03,580
So what do you think is
going on in this new

611
00:26:03,580 --> 00:26:06,130
architecture now?

612
00:26:06,130 --> 00:26:07,400
Why this might be?

613
00:26:15,730 --> 00:26:16,980
AUDIENCE: [INAUDIBLE]

614
00:26:26,030 --> 00:26:28,800
PROFESSOR: That's an interesting
observation, but

615
00:26:28,800 --> 00:26:32,050
also we have--

616
00:26:32,050 --> 00:26:36,250
yes, core seven basically is
two by two in the die.

617
00:26:36,250 --> 00:26:38,650
But we also have two different
processors.

618
00:26:38,650 --> 00:26:40,110
So that's there, too.

619
00:26:40,110 --> 00:26:43,640
So in some sense, when you
get our two-way process

620
00:26:43,640 --> 00:26:44,670
[UNINTELLIGIBLE]

621
00:26:44,670 --> 00:26:45,620
So that's there.

622
00:26:45,620 --> 00:26:46,370
That might help.

623
00:26:46,370 --> 00:26:48,650
That's an interesting
observation.

624
00:26:48,650 --> 00:26:50,340
What else might be
going on here?

625
00:26:50,340 --> 00:26:52,880
Why do you think they manage
to get this one?

626
00:26:52,880 --> 00:26:54,580
What might be another answer?

627
00:26:58,790 --> 00:27:03,350
What can hide these kind of
delays that can happen?

628
00:27:03,350 --> 00:27:06,990
Load delays, and cache misses,
and stuff like that.

629
00:27:06,990 --> 00:27:09,110
What techniques and hardware
can hide those?

630
00:27:15,300 --> 00:27:17,050
Just [UNINTELLIGIBLE]
a speculation.

631
00:27:17,050 --> 00:27:18,610
Prefetching.

632
00:27:18,610 --> 00:27:22,290
So most hardware has an
internal mechanism.

633
00:27:22,290 --> 00:27:24,890
When you start fetching
data, you say, aha!

634
00:27:24,890 --> 00:27:26,420
I see a pattern.

635
00:27:26,420 --> 00:27:27,650
I know you want to
get this thing.

636
00:27:27,650 --> 00:27:31,960
Let me go forward and bring more
data, thinking you are

637
00:27:31,960 --> 00:27:34,240
going to follow that pattern.

638
00:27:34,240 --> 00:27:34,590
OK.

639
00:27:34,590 --> 00:27:36,240
All or most of the
[UNINTELLIGIBLE]

640
00:27:36,240 --> 00:27:39,010
for, I think, have
[UNINTELLIGIBLE] even a

641
00:27:39,010 --> 00:27:42,280
Pentium had something for
prefetching going on.

642
00:27:42,280 --> 00:27:44,470
But most of the time, what
happens is the prefetching

643
00:27:44,470 --> 00:27:45,970
engine can't keep up.

644
00:27:45,970 --> 00:27:47,930
If you are getting there,
it's [UNINTELLIGIBLE]

645
00:27:47,930 --> 00:27:48,880
a little bit further.

646
00:27:48,880 --> 00:27:50,650
You are going to catch up, and
you're going to start because

647
00:27:50,650 --> 00:27:53,026
you have more [UNINTELLIGIBLE]
here.

648
00:27:53,026 --> 00:27:54,980
I think [UNINTELLIGIBLE]

649
00:27:54,980 --> 00:27:56,880
has a really, really
good prefetcher.

650
00:27:56,880 --> 00:28:02,170
And then, we saw it in our
architecture slides, too.

651
00:28:02,170 --> 00:28:04,940
That a lot of things that used
to happen before is gone.

652
00:28:04,940 --> 00:28:05,990
So this is really good.

653
00:28:05,990 --> 00:28:11,330
What that means is a lot of
weird stuff that's going on

654
00:28:11,330 --> 00:28:12,360
[UNINTELLIGIBLE]

655
00:28:12,360 --> 00:28:14,010
making them disappear.

656
00:28:14,010 --> 00:28:15,990
So these kind of problems
don't show up.

657
00:28:15,990 --> 00:28:17,260
So that's the nice story.

658
00:28:17,260 --> 00:28:18,920
The other part is, OK.

659
00:28:18,920 --> 00:28:22,630
Now if you start really tweaking
your programs to one

660
00:28:22,630 --> 00:28:25,120
architecture, you wait
a generation.

661
00:28:25,120 --> 00:28:28,300
And then now, we have done
either the tweaking--

662
00:28:28,300 --> 00:28:33,910
the best case, tweaking has
no impact, and it's

663
00:28:33,910 --> 00:28:35,170
not affecting anything.

664
00:28:35,170 --> 00:28:37,235
In most of the time, worst case,
tweaking actually slows

665
00:28:37,235 --> 00:28:39,160
down the program because you
are trying to do something

666
00:28:39,160 --> 00:28:40,030
complicated.

667
00:28:40,030 --> 00:28:42,790
That's just not needed
anymore.

668
00:28:42,790 --> 00:28:46,990
So even though these kind
of things showed up in

669
00:28:46,990 --> 00:28:47,500
[UNINTELLIGIBLE]

670
00:28:47,500 --> 00:28:48,750
architecture, it's
not an issue.

671
00:28:48,750 --> 00:28:52,250
But if you go to many of the
smaller architectures that

672
00:28:52,250 --> 00:28:55,830
have that don't have that
much of the very popular

673
00:28:55,830 --> 00:28:57,620
prefetchers, this kind of
issue you would see.

674
00:28:57,620 --> 00:29:00,610
So for example, if you go to a
cell phone [UNINTELLIGIBLE],

675
00:29:00,610 --> 00:29:03,770
you would probably see these
kind of issues happening.

676
00:29:03,770 --> 00:29:05,560
Any questions here so far?

677
00:29:05,560 --> 00:29:06,300
So that's the good news.

678
00:29:06,300 --> 00:29:08,280
You guys don't have to worry
about it too much.

679
00:29:08,280 --> 00:29:11,440
But at least it's good to know
the technique because you'll

680
00:29:11,440 --> 00:29:13,820
see it in other architectures.

681
00:29:13,820 --> 00:29:18,110
So now, I want to switch a
little bit into looking at

682
00:29:18,110 --> 00:29:21,630
programs that don't have what
we call data parallelism.

683
00:29:21,630 --> 00:29:24,320
That means you can start and
say, [UNINTELLIGIBLE]

684
00:29:24,320 --> 00:29:24,740
parallels.

685
00:29:24,740 --> 00:29:26,840
Everybody get the different
chunk and run.

686
00:29:26,840 --> 00:29:30,110
And we are going a little bit
more deeply into looking at

687
00:29:30,110 --> 00:29:32,230
programs that are a little
bit different.

688
00:29:32,230 --> 00:29:36,540
So I wanted to come up with this
little representation to

689
00:29:36,540 --> 00:29:37,580
represent the program.

690
00:29:37,580 --> 00:29:42,090
And so if you think about
iteration space--

691
00:29:42,090 --> 00:29:44,090
actually before you, I'll
go down to dependence.

692
00:29:44,090 --> 00:29:46,200
I'll also do a little
bit of load balance.

693
00:29:46,200 --> 00:29:49,890
So here's a loop that
in my iterations--

694
00:29:49,890 --> 00:29:53,690
the first one I transformed
zero to eight.

695
00:29:53,690 --> 00:29:55,670
But J only runs from
one to eight.

696
00:29:55,670 --> 00:29:58,630
So each I, I have less
and less amount of

697
00:29:58,630 --> 00:30:02,160
J iterations, basically.

698
00:30:02,160 --> 00:30:02,600
OK?

699
00:30:02,600 --> 00:30:05,100
It's a triangular loop.

700
00:30:05,100 --> 00:30:06,350
OK?

701
00:30:10,050 --> 00:30:10,250
OK.

702
00:30:10,250 --> 00:30:12,420
So this is the way to represent
iteration space, so

703
00:30:12,420 --> 00:30:15,600
I will represent data and
get back to this again.

704
00:30:15,600 --> 00:30:18,830
So if you look at a data space,
you can assume data

705
00:30:18,830 --> 00:30:23,740
iteration space could be this
funky, triangular, hyperplane

706
00:30:23,740 --> 00:30:24,760
type of thing.

707
00:30:24,760 --> 00:30:28,990
Whereas data is mostly
[? rectangulineum ?],

708
00:30:28,990 --> 00:30:30,740
multi-dimensional rectangle.

709
00:30:30,740 --> 00:30:32,810
So for example, if I have
[UNINTELLIGIBLE]

710
00:30:32,810 --> 00:30:35,140
and it's a one-dimensional
one, this is basically a

711
00:30:35,140 --> 00:30:36,050
two-dimensional data.

712
00:30:36,050 --> 00:30:37,480
And you can have
three-dimensional cubes and

713
00:30:37,480 --> 00:30:38,020
stuff like that.

714
00:30:38,020 --> 00:30:39,500
You can represent
data like that.

715
00:30:39,500 --> 00:30:41,170
So this is a way to
nicely represent.

716
00:30:41,170 --> 00:30:45,000
And when you start thinking
about it, we can look at

717
00:30:45,000 --> 00:30:45,850
what's going on.

718
00:30:45,850 --> 00:30:46,780
OK?

719
00:30:46,780 --> 00:30:49,870
So now you have this
loop again.

720
00:30:49,870 --> 00:30:52,820
So here's the basic
[UNINTELLIGIBLE] iterations.

721
00:30:52,820 --> 00:30:54,140
And here's the data.

722
00:30:54,140 --> 00:30:56,850
Assume this is both A and B.
There will be another one for

723
00:30:56,850 --> 00:30:57,480
matrix [UNINTELLIGIBLE]

724
00:30:57,480 --> 00:30:59,870
B. One data into each iteration
is going to touch.

725
00:30:59,870 --> 00:31:01,700
So these are the data that
need to get touched, and

726
00:31:01,700 --> 00:31:04,270
here's the iterations you
are going to run.

727
00:31:04,270 --> 00:31:09,720
So we can say OpenMP
parallel four.

728
00:31:09,720 --> 00:31:12,200
So what happens when you
do parallel four?

729
00:31:12,200 --> 00:31:14,630
So I am going to get parallel.

730
00:31:14,630 --> 00:31:17,930
And so core one gets this one,
another core, another core,

731
00:31:17,930 --> 00:31:20,610
another core get these
iterations running.

732
00:31:20,610 --> 00:31:22,398
So what happens if
you do this one?

733
00:31:26,520 --> 00:31:29,590
Do you get really good
performance?

734
00:31:29,590 --> 00:31:31,852
Why?

735
00:31:31,852 --> 00:31:34,570
AUDIENCE: [INAUDIBLE]

736
00:31:34,570 --> 00:31:35,080
PROFESSOR: It's not balanced.

737
00:31:35,080 --> 00:31:36,900
The load is not balanced
in here.

738
00:31:36,900 --> 00:31:40,150
So basically if you run
sequential and if you run

739
00:31:40,150 --> 00:31:45,990
block distribution, I get about
3x performance in here.

740
00:31:45,990 --> 00:31:49,470
So if I look at closely, here
is the number of iterations

741
00:31:49,470 --> 00:31:51,340
given to each core.

742
00:31:51,340 --> 00:31:53,790
The first core gets almost
nothing, and the last guy gets

743
00:31:53,790 --> 00:31:55,380
a lot of work.

744
00:31:55,380 --> 00:31:58,510
Here's where something like the
Cilk runtime can come into

745
00:31:58,510 --> 00:32:01,955
play because with Cilk runtime,
basically, this guy

746
00:32:01,955 --> 00:32:03,040
will finish the [UNINTELLIGIBLE]

747
00:32:03,040 --> 00:32:04,620
start stealing from
somebody else.

748
00:32:04,620 --> 00:32:07,620
And so it would be
done nicely.

749
00:32:07,620 --> 00:32:09,860
But whereas if you do a
static schedule, you

750
00:32:09,860 --> 00:32:11,190
are in this big bind.

751
00:32:11,190 --> 00:32:13,270
You don't have too many
things going on.

752
00:32:17,990 --> 00:32:20,650
And basically, this is what
we call load imbalance.

753
00:32:20,650 --> 00:32:24,760
So what you can do is figure out
a complicated partitioning

754
00:32:24,760 --> 00:32:27,645
so you can statically
partition this out.

755
00:32:27,645 --> 00:32:31,810
Or you can do something like the
dynamic scheduler like the

756
00:32:31,810 --> 00:32:32,530
[UNINTELLIGIBLE]

757
00:32:32,530 --> 00:32:35,720
scheduler for a solution.

758
00:32:35,720 --> 00:32:36,970
So how to detect
load imbalance?

759
00:32:39,290 --> 00:32:44,120
Basically, what you might want
to do is for each of the

760
00:32:44,120 --> 00:32:45,690
different sections you are
running, you want to look at

761
00:32:45,690 --> 00:32:47,520
the time mistakes.

762
00:32:47,520 --> 00:32:50,530
And in the [UNINTELLIGIBLE] axis
varying, huge varying,

763
00:32:50,530 --> 00:32:52,260
that means there's a load
imbalance going on.

764
00:32:52,260 --> 00:32:55,345
So you might want to check and
make sure each of the parallel

765
00:32:55,345 --> 00:32:58,330
regions time is taking.

766
00:32:58,330 --> 00:33:01,370
And that gives you this view.

767
00:33:01,370 --> 00:33:04,620
How to eliminate load imbalance
or the use of

768
00:33:04,620 --> 00:33:08,500
dynamic scheduler that
will deal with that.

769
00:33:08,500 --> 00:33:12,220
Or you can do a different
distribution statically.

770
00:33:12,220 --> 00:33:14,690
That will not partition
in this large block.

771
00:33:14,690 --> 00:33:16,900
So let me show you a static part
because we have already

772
00:33:16,900 --> 00:33:18,440
learned the dynamic
part before.

773
00:33:18,440 --> 00:33:21,165
So now instead of doing that,
we do a cyclic distribution.

774
00:33:21,165 --> 00:33:23,370
We use a static one.

775
00:33:23,370 --> 00:33:27,690
That means if you have a lot
more than and a little bit

776
00:33:27,690 --> 00:33:32,220
better distribution so what
happens to the processor?

777
00:33:32,220 --> 00:33:33,810
Zero gets this one
and this one.

778
00:33:33,810 --> 00:33:35,740
One gets this one
and this one.

779
00:33:35,740 --> 00:33:36,860
So on and so forth.

780
00:33:36,860 --> 00:33:39,380
So that would be a little bit
between balancing there.

781
00:33:39,380 --> 00:33:42,830
But if you have enough of
cyclic, the imbalance would be

782
00:33:42,830 --> 00:33:45,550
much lower.

783
00:33:45,550 --> 00:33:47,610
So should we run faster?

784
00:33:51,070 --> 00:33:56,470
So here's the iterations
each guy gets in here.

785
00:33:56,470 --> 00:33:59,360
This looks very balanced because
I had a lot more

786
00:33:59,360 --> 00:34:01,810
iterations than this
eight one.

787
00:34:01,810 --> 00:34:03,420
This is not that balanced here
because this guy gets a lot

788
00:34:03,420 --> 00:34:04,800
more than the first one.

789
00:34:04,800 --> 00:34:06,820
The first one gets six.

790
00:34:06,820 --> 00:34:09,070
And the second and last
one gets a lot more.

791
00:34:13,775 --> 00:34:16,560
Uh oh.

792
00:34:16,560 --> 00:34:18,060
What do you think is
happening here now?

793
00:34:22,040 --> 00:34:23,290
I ran again slower.

794
00:34:25,870 --> 00:34:28,429
See I guess the people in class
last year had things

795
00:34:28,429 --> 00:34:31,199
worse because they had this
old processor that did all

796
00:34:31,199 --> 00:34:33,010
these crazy things on them.

797
00:34:33,010 --> 00:34:35,830
and you guys get the fast one
that doesn't do that.

798
00:34:35,830 --> 00:34:41,310
So why do you think cyclic
distribution is

799
00:34:41,310 --> 00:34:43,639
running a lot slower?

800
00:34:43,639 --> 00:34:44,454
What might be going?

801
00:34:44,454 --> 00:34:45,704
AUDIENCE: [INAUDIBLE]

802
00:34:47,420 --> 00:34:49,219
PROFESSOR: Spoiling
[UNINTELLIGIBLE] it's not that

803
00:34:49,219 --> 00:34:51,830
much because if you don't run
this and synchronize, what you

804
00:34:51,830 --> 00:34:54,929
do is you run the same amount of
tread and say, now, instead

805
00:34:54,929 --> 00:34:59,260
of running continuously, you
run jumping all iterations.

806
00:34:59,260 --> 00:35:01,810
You should run zero and nine
or whatever jump over

807
00:35:01,810 --> 00:35:03,060
iterations.

808
00:35:12,500 --> 00:35:13,130
Why do you think?

809
00:35:13,130 --> 00:35:14,630
AUDIENCE: [INAUDIBLE]

810
00:35:14,630 --> 00:35:16,240
PROFESSOR: Yeah, there's
a cache issue.

811
00:35:16,240 --> 00:35:19,980
All this time and the question
is not sure, it's probably a

812
00:35:19,980 --> 00:35:20,470
cache issue.

813
00:35:20,470 --> 00:35:22,550
What type of cache issue do
you think is going on?

814
00:35:22,550 --> 00:35:23,800
AUDIENCE: [INAUDIBLE]

815
00:35:28,920 --> 00:35:29,410
PROFESSOR: Yeah.

816
00:35:29,410 --> 00:35:31,010
[UNINTELLIGIBLE].

817
00:35:31,010 --> 00:35:34,280
But let me show you
what happens.

818
00:35:34,280 --> 00:35:35,580
So you get off then.

819
00:35:35,580 --> 00:35:37,780
OK, so if you look at--

820
00:35:37,780 --> 00:35:40,150
the data is here so let's
look at what happens.

821
00:35:40,150 --> 00:35:42,950
So this is running a
[UNINTELLIGIBLE] lower.

822
00:35:42,950 --> 00:35:45,460
It's showing a lot more
instructions, but instruction

823
00:35:45,460 --> 00:35:47,710
doesn't tell you too much
because a lot of them might be

824
00:35:47,710 --> 00:35:49,360
missing synchronization
costs in here.

825
00:35:49,360 --> 00:35:52,690
So instruction is not that
illuminating here.

826
00:35:52,690 --> 00:35:54,850
The big illumination here
is this one again.

827
00:35:54,850 --> 00:35:56,950
Invalidations.

828
00:35:56,950 --> 00:35:59,300
I have a huge amount of
invalidations going on.

829
00:35:59,300 --> 00:36:04,330
So here is a case of
false sharing.

830
00:36:04,330 --> 00:36:08,080
So what happens is now things
next to each other, you want

831
00:36:08,080 --> 00:36:09,550
to multiply different
processors.

832
00:36:09,550 --> 00:36:10,890
We're not touching
the same data.

833
00:36:10,890 --> 00:36:13,250
Everybody's looking at
somebody else's data.

834
00:36:13,250 --> 00:36:16,290
So what happens is assume I want
to write this data item.

835
00:36:16,290 --> 00:36:17,550
I like that data item.

836
00:36:17,550 --> 00:36:22,430
But I get the entire cache line
because when I ask for

837
00:36:22,430 --> 00:36:25,330
that, I get my synchronization
by the cache line.

838
00:36:25,330 --> 00:36:28,830
I get this entire cache line
coming in here into this one.

839
00:36:28,830 --> 00:36:30,550
And the next guys
[UNINTELLIGIBLE] at me.

840
00:36:30,550 --> 00:36:33,020
This core won't write this
data because instead of

841
00:36:33,020 --> 00:36:36,000
blocks, I basically
give each strips.

842
00:36:36,000 --> 00:36:37,840
There's a lot of overlap
between strips.

843
00:36:37,840 --> 00:36:41,000
So this guy says not to write
this one, I had to get the

844
00:36:41,000 --> 00:36:43,580
entire cache line
going back here.

845
00:36:43,580 --> 00:36:45,630
And so if you want to write that
again, I had to get the

846
00:36:45,630 --> 00:36:47,680
entire cache line going back
even though we are writing

847
00:36:47,680 --> 00:36:48,790
different data.

848
00:36:48,790 --> 00:36:52,010
Because we are sharing
cache lines in here.

849
00:36:52,010 --> 00:36:53,880
This thinking was in back
and forth, back and

850
00:36:53,880 --> 00:36:54,960
forth, back and forth.

851
00:36:54,960 --> 00:36:56,520
I have a lot of cache
[UNINTELLIGIBLE].

852
00:36:56,520 --> 00:36:59,200
Things are really shot.

853
00:36:59,200 --> 00:37:00,160
OK?

854
00:37:00,160 --> 00:37:03,740
And so what happens here is if
you look at the cache lines--

855
00:37:03,740 --> 00:37:05,140
there's my animation.

856
00:37:05,140 --> 00:37:07,280
So cache lines basically
mess this all up.

857
00:37:07,280 --> 00:37:08,690
You can see that really
carefully.

858
00:37:08,690 --> 00:37:12,500
What happens is between these
lines, there would be some

859
00:37:12,500 --> 00:37:13,750
overlap of cache lines.

860
00:37:15,710 --> 00:37:17,890
And this overlap in cache lines
keeps bouncing back and

861
00:37:17,890 --> 00:37:19,990
forth, back and forth in here.

862
00:37:19,990 --> 00:37:23,630
And so what happens is basically
cache lines are

863
00:37:23,630 --> 00:37:28,850
bigger than the data size, or
there's overlap in here, and

864
00:37:28,850 --> 00:37:31,480
the cache line is shared when
the data is not shared.

865
00:37:31,480 --> 00:37:37,050
And so how to detect false
sharing in too many conflicts.

866
00:37:37,050 --> 00:37:41,340
You assume this is a nice
parallelism, but suddenly, you

867
00:37:41,340 --> 00:37:43,840
don't have a speed up, and you
have a lot of conflicts here,

868
00:37:43,840 --> 00:37:47,090
even though there isn't
something to be sharing.

869
00:37:47,090 --> 00:37:49,770
And how to eliminate
false sharing.

870
00:37:49,770 --> 00:37:54,250
Make data used by each
contiguous in memory.

871
00:37:54,250 --> 00:37:55,550
That's a good way
of doing that.

872
00:37:55,550 --> 00:37:57,360
Or pad at the end.

873
00:37:57,360 --> 00:38:00,330
So these kind of at the corners,
there's not going to

874
00:38:00,330 --> 00:38:02,830
be any overlapping.

875
00:38:02,830 --> 00:38:07,770
So in here, one thing you can
do is, you can measure each

876
00:38:07,770 --> 00:38:10,370
thing that each of
the cores get.

877
00:38:10,370 --> 00:38:11,350
We can make [UNINTELLIGIBLE].

878
00:38:11,350 --> 00:38:14,560
But before what happens was
a core used to get this

879
00:38:14,560 --> 00:38:15,970
line and this line.

880
00:38:15,970 --> 00:38:17,730
There are different
places in memory.

881
00:38:17,730 --> 00:38:20,220
But you can make these two
contiguous in memory by

882
00:38:20,220 --> 00:38:22,190
basically now, instead of
having a two-dimensional

883
00:38:22,190 --> 00:38:23,940
array, you made that a

884
00:38:23,940 --> 00:38:26,210
three-dimensional or disarrays.

885
00:38:26,210 --> 00:38:27,890
AUDIENCE: Can you
say that again?

886
00:38:27,890 --> 00:38:31,030
PROFESSOR: So before you what
just happened was each of them

887
00:38:31,030 --> 00:38:34,210
were going to get this line
and this line, each core.

888
00:38:34,210 --> 00:38:38,070
All these lines that were in
different parts of the memory.

889
00:38:38,070 --> 00:38:40,510
In here, each would get
only two lines.

890
00:38:40,510 --> 00:38:41,570
But they're in a different
place.

891
00:38:41,570 --> 00:38:43,845
So if you have more cyclic,
you'll get a lot more lines or

892
00:38:43,845 --> 00:38:44,820
lower memory.

893
00:38:44,820 --> 00:38:47,830
So what we can do is we
can arrange the cache.

894
00:38:47,830 --> 00:38:50,560
So if you think about this, you
can think the cache, now

895
00:38:50,560 --> 00:38:53,220
the data, is instead of
two-dimensions is

896
00:38:53,220 --> 00:38:55,390
three-dimensional data.

897
00:38:55,390 --> 00:38:58,360
One dimension is this
cyclic part in here.

898
00:38:58,360 --> 00:38:59,450
So we can do that.

899
00:38:59,450 --> 00:39:04,030
And then, you can change any way
that the cyclic part, the

900
00:39:04,030 --> 00:39:06,290
one that I got this line and
this line, now become

901
00:39:06,290 --> 00:39:07,660
contiguous.

902
00:39:07,660 --> 00:39:10,390
So you think about data
as a two-dimension.

903
00:39:10,390 --> 00:39:11,790
You think about it as a cube.

904
00:39:11,790 --> 00:39:14,690
And you kind of change the cube
for the inner dimension

905
00:39:14,690 --> 00:39:16,670
to be the one that's
contiguous.

906
00:39:16,670 --> 00:39:18,980
So you can do data
[UNINTELLIGIBLE]

907
00:39:18,980 --> 00:39:20,230
transformation and get there.

908
00:39:23,290 --> 00:39:29,130
So now what happens is the role
of core zero just gets

909
00:39:29,130 --> 00:39:30,590
contiguous in memory.

910
00:39:30,590 --> 00:39:32,370
And core one gets contiguous
in memory.

911
00:39:32,370 --> 00:39:34,600
So if you're trying to make it
contiguous, that's great.

912
00:39:34,600 --> 00:39:37,930
So between padding and making
things contiguous, you can get

913
00:39:37,930 --> 00:39:38,840
good performance.

914
00:39:38,840 --> 00:39:41,520
And if you do data
transformation, voila!

915
00:39:41,520 --> 00:39:44,940
My invalidations just went
down drastically.

916
00:39:44,940 --> 00:39:47,900
I again have a nice load
balancing here.

917
00:39:47,900 --> 00:39:50,100
Invalidations went
down drastically.

918
00:39:50,100 --> 00:39:53,660
That means my [UNINTELLIGIBLE]
increased a little bit and I

919
00:39:53,660 --> 00:39:56,600
get really nice speed up.

920
00:39:56,600 --> 00:40:01,320
So here are the kind of crazy
things you are to do if you

921
00:40:01,320 --> 00:40:05,200
are doing things like
algorithms that

922
00:40:05,200 --> 00:40:07,100
are not cache obvious.

923
00:40:07,100 --> 00:40:08,820
And if you are doing directly
parallizing yourself without

924
00:40:08,820 --> 00:40:11,890
letting a nice [UNINTELLIGIBLE]

925
00:40:11,890 --> 00:40:12,660
time to help you.

926
00:40:12,660 --> 00:40:16,580
Something like a
[UNINTELLIGIBLE] assistant.

927
00:40:16,580 --> 00:40:19,710
So I'm just going to
summarize this

928
00:40:19,710 --> 00:40:20,990
because this is important.

929
00:40:20,990 --> 00:40:22,630
We looked at a bunch
of cache issues.

930
00:40:22,630 --> 00:40:25,420
We looked at cold missiles,
capacity missiles, and

931
00:40:25,420 --> 00:40:26,970
conflict missiles before.

932
00:40:26,970 --> 00:40:29,390
And today, here are
some examples of

933
00:40:29,390 --> 00:40:31,340
true sharing missiles.

934
00:40:31,340 --> 00:40:36,160
What happened was I am actually
really using data,

935
00:40:36,160 --> 00:40:41,880
but I set up my parallelism
in such a way that between

936
00:40:41,880 --> 00:40:46,000
different executions, my data
has to move across.

937
00:40:46,000 --> 00:40:47,030
[UNINTELLIGIBLE]

938
00:40:47,030 --> 00:40:51,230
So I am truly sharing data,
but the data has to go to

939
00:40:51,230 --> 00:40:52,250
somebody else's cache.

940
00:40:52,250 --> 00:40:53,110
So I've got a lot of
[UNINTELLIGIBLE]

941
00:40:53,110 --> 00:40:54,830
violations here.

942
00:40:54,830 --> 00:40:58,420
More into this one is more like
false sharing, where you

943
00:40:58,420 --> 00:41:01,240
assume there's no sharing, nice
parallelism, everything,

944
00:41:01,240 --> 00:41:03,120
except the program
runs very slow.

945
00:41:03,120 --> 00:41:05,610
And that can be because
of false sharing.

946
00:41:05,610 --> 00:41:09,550
So we just kind of touch
on these two topics.

947
00:41:09,550 --> 00:41:10,660
OK?

948
00:41:10,660 --> 00:41:14,340
So let me switch gears a little
bit about dependences.

949
00:41:14,340 --> 00:41:16,380
We touched on the dependences
a little bit.

950
00:41:16,380 --> 00:41:19,070
And these are two fine
programs that are not

951
00:41:19,070 --> 00:41:20,410
completely parallel.

952
00:41:20,410 --> 00:41:24,710
So normally, what happens is a
true dependence means that I'm

953
00:41:24,710 --> 00:41:26,570
writing and reading
[UNINTELLIGIBLE]

954
00:41:26,570 --> 00:41:27,660
other way out.

955
00:41:27,660 --> 00:41:30,400
And if two guys are both
fighting, then the order has

956
00:41:30,400 --> 00:41:32,780
to maintiain us out would
be dependence.

957
00:41:32,780 --> 00:41:38,650
And did our dependence
even loop, because

958
00:41:38,650 --> 00:41:40,880
these are single items.

959
00:41:40,880 --> 00:41:43,760
So if you have an error here,
this is becoming a lot more

960
00:41:43,760 --> 00:41:44,600
complicated.

961
00:41:44,600 --> 00:41:46,200
Because there's no simple
thing in here.

962
00:41:46,200 --> 00:41:48,450
Because it's not just using
the same iteration.

963
00:41:48,450 --> 00:41:51,910
You might be using data from
different iterations.

964
00:41:51,910 --> 00:41:55,920
So what happens is there's
a dynamic instance of

965
00:41:55,920 --> 00:41:56,840
iterations.

966
00:41:56,840 --> 00:41:59,230
So one iteration writes the
data, and somebody else might

967
00:41:59,230 --> 00:42:01,310
be reading the data.

968
00:42:01,310 --> 00:42:03,490
And that is basically
the order we have to

969
00:42:03,490 --> 00:42:03,990
[UNINTELLIGIBLE].

970
00:42:03,990 --> 00:42:05,060
Let me give you an example.

971
00:42:05,060 --> 00:42:07,580
This kind of demonstrates
what's going on.

972
00:42:07,580 --> 00:42:08,170
OK?

973
00:42:08,170 --> 00:42:10,500
And when you edit, you say
look, this is where you

974
00:42:10,500 --> 00:42:10,910
[UNINTELLIGIBLE]

975
00:42:10,910 --> 00:42:12,590
complicated.

976
00:42:12,590 --> 00:42:14,380
So in order to give you
and example, let me

977
00:42:14,380 --> 00:42:15,876
look at this program.

978
00:42:15,876 --> 00:42:17,150
I have a simple program here.

979
00:42:17,150 --> 00:42:19,210
Ai equals Ai plus one.

980
00:42:19,210 --> 00:42:19,990
My iterations--

981
00:42:19,990 --> 00:42:21,570
I'm running five iterations
in here.

982
00:42:21,570 --> 00:42:23,420
So this is my iteration space.

983
00:42:23,420 --> 00:42:25,590
I have a large array, so
this is my data space.

984
00:42:28,090 --> 00:42:29,990
And now, I keep running
this program.

985
00:42:29,990 --> 00:42:32,800
So what happens is this is
time going down in here.

986
00:42:32,800 --> 00:42:35,930
So the first situation
basically, I first read and

987
00:42:35,930 --> 00:42:36,560
then write.

988
00:42:36,560 --> 00:42:39,350
Same in the second iteration,
I read and write.

989
00:42:39,350 --> 00:42:40,700
Third iteration read
and write.

990
00:42:40,700 --> 00:42:42,770
Fourth iteration,
read and write.

991
00:42:42,770 --> 00:42:44,610
Do you see how this is going
on these four situations?

992
00:42:44,610 --> 00:42:46,050
Second iteration, third
iteration, fourth iteration,

993
00:42:46,050 --> 00:42:48,115
[UNINTELLIGIBLE].

994
00:42:48,115 --> 00:42:48,580
OK.

995
00:42:48,580 --> 00:42:50,710
So what happens is first
iteration read

996
00:42:50,710 --> 00:42:53,810
this value is zero.

997
00:42:53,810 --> 00:42:56,170
And write the value as zero
in the menu writing.

998
00:42:56,170 --> 00:43:00,930
Second iteration A1,
A1, A2, A2, A3, A3.

999
00:43:00,930 --> 00:43:07,290
So now, when this is writing,
that's a dependence

1000
00:43:07,290 --> 00:43:08,160
between these two.

1001
00:43:08,160 --> 00:43:10,150
You see the true and entire
output dependence

1002
00:43:10,150 --> 00:43:11,400
between these two.

1003
00:43:15,545 --> 00:43:18,270
What type of dependence
do we have?

1004
00:43:18,270 --> 00:43:19,520
[UNINTELLIGIBLE] dependence.

1005
00:43:24,240 --> 00:43:26,340
True dependence is what?

1006
00:43:26,340 --> 00:43:28,780
What to what?

1007
00:43:28,780 --> 00:43:29,940
What's the first thing
that occurs?

1008
00:43:29,940 --> 00:43:33,620
What's the next thing
that occurs?

1009
00:43:33,620 --> 00:43:34,875
Anybody want to answer?

1010
00:43:40,820 --> 00:43:43,140
AUDIENCE: [INAUDIBLE]

1011
00:43:43,140 --> 00:43:43,900
PROFESSOR: Write to read.

1012
00:43:43,900 --> 00:43:45,720
So you have the first thing
has to be write to read.

1013
00:43:45,720 --> 00:43:46,570
Watch this.

1014
00:43:46,570 --> 00:43:48,800
This is a read to write.

1015
00:43:48,800 --> 00:43:51,230
So what type of dependence
is this?

1016
00:43:51,230 --> 00:43:52,240
This is anti-dependent.

1017
00:43:52,240 --> 00:43:54,680
So here is ante-dependence
in here.

1018
00:43:54,680 --> 00:43:57,710
But the nice thing about that is
this dependent didn't cross

1019
00:43:57,710 --> 00:43:58,940
the iteration boundary.

1020
00:43:58,940 --> 00:44:01,050
So these black lines are my
iteration boundaries.

1021
00:44:01,050 --> 00:44:03,190
So these are for situations
that [UNINTELLIGIBLE].

1022
00:44:03,190 --> 00:44:06,780
So there's no iteration
crossing in here.

1023
00:44:06,780 --> 00:44:09,360
You can kind of [UNINTELLIGIBLE]
it using each

1024
00:44:09,360 --> 00:44:13,590
of these iterations and my
dependencies within the very

1025
00:44:13,590 --> 00:44:14,360
same iteration.

1026
00:44:14,360 --> 00:44:18,116
So the same iteration I have
dependency [UNINTELLIGIBLE].

1027
00:44:18,116 --> 00:44:18,590
OK?

1028
00:44:18,590 --> 00:44:19,910
This is a simpler case.

1029
00:44:19,910 --> 00:44:24,060
So let's look at something a
little bit more complicated.

1030
00:44:24,060 --> 00:44:28,560
So I have Ai plus 1
equals Ai plus 1.

1031
00:44:28,560 --> 00:44:32,430
So what happens is first
I am reading Ai.

1032
00:44:32,430 --> 00:44:39,270
And then, I am writing Ai plus
1 in the same iteration.

1033
00:44:39,270 --> 00:44:41,100
The next iteration, I
am reading now Ai

1034
00:44:41,100 --> 00:44:42,050
[UNINTELLIGIBLE]

1035
00:44:42,050 --> 00:44:44,250
this is A0 and 1.

1036
00:44:44,250 --> 00:44:45,890
This is A is 1.

1037
00:44:45,890 --> 00:44:46,410
[UNINTELLIGIBLE]

1038
00:44:46,410 --> 00:44:47,880
I am writing Ai plus 1.

1039
00:44:47,880 --> 00:44:49,130
I am writing 2.

1040
00:44:51,610 --> 00:44:54,070
So I have a dependence
like this now.

1041
00:44:54,070 --> 00:44:55,320
What type of dependence
is this?

1042
00:44:58,190 --> 00:45:00,560
This is a true dependence
because I am writing.

1043
00:45:00,560 --> 00:45:04,610
And this is actually reading
what what it is writing.

1044
00:45:04,610 --> 00:45:08,720
So does this look parallel?

1045
00:45:08,720 --> 00:45:09,590
No.

1046
00:45:09,590 --> 00:45:13,790
Because what happens is if you
look at each iteration depends

1047
00:45:13,790 --> 00:45:15,680
on the previous iteration.

1048
00:45:15,680 --> 00:45:18,040
So you have to actually have
this dependence going back and

1049
00:45:18,040 --> 00:45:20,890
forth, back and forth in here.

1050
00:45:20,890 --> 00:45:23,550
So let's look at a couple
more other things.

1051
00:45:23,550 --> 00:45:26,830
So here is Ai equals
Ai plus 2.

1052
00:45:26,830 --> 00:45:30,660
So I am basically reading
Ai plus 2.

1053
00:45:30,660 --> 00:45:31,670
So I am reading this one.

1054
00:45:31,670 --> 00:45:33,220
I am writing this one.

1055
00:45:33,220 --> 00:45:33,810
Reading this one.

1056
00:45:33,810 --> 00:45:36,060
Writing this one.

1057
00:45:36,060 --> 00:45:38,040
Here is my dependence
that's in here.

1058
00:45:38,040 --> 00:45:39,620
You see the two are
anti in here.

1059
00:45:42,210 --> 00:45:44,150
This is anti-dependence because
I am going from a

1060
00:45:44,150 --> 00:45:47,160
reading to a write in here.

1061
00:45:47,160 --> 00:45:48,510
Can this loop be parallel?

1062
00:45:55,240 --> 00:45:57,744
Can this loop run parallel?

1063
00:45:57,744 --> 00:46:00,440
AUDIENCE: [INAUDIBLE]

1064
00:46:00,440 --> 00:46:04,090
PROFESSOR: So can every
iteration run parallel?

1065
00:46:04,090 --> 00:46:05,030
There could be basically.

1066
00:46:05,030 --> 00:46:07,590
No because what happens is if
you look at that, there's a

1067
00:46:07,590 --> 00:46:09,690
dependence that goes
like this.

1068
00:46:09,690 --> 00:46:11,060
And of course, there
are two chains.

1069
00:46:11,060 --> 00:46:14,170
So if you are interested, you
can run at least two-way

1070
00:46:14,170 --> 00:46:15,210
parallelism.

1071
00:46:15,210 --> 00:46:18,200
You can run one chain parallel
to another chain when you do

1072
00:46:18,200 --> 00:46:21,130
get that much parallelism.

1073
00:46:21,130 --> 00:46:22,380
How about this one?

1074
00:46:24,740 --> 00:46:26,970
2i and 2i plus 1.

1075
00:46:26,970 --> 00:46:28,220
[UNINTELLIGIBLE]

1076
00:46:29,890 --> 00:46:31,140
Is there independence in here?

1077
00:46:34,740 --> 00:46:36,120
Nope because one is--

1078
00:46:36,120 --> 00:46:38,300
you are reading all the elements
and even writing

1079
00:46:38,300 --> 00:46:39,210
elements [UNINTELLIGIBLE]

1080
00:46:39,210 --> 00:46:39,730
dependence.

1081
00:46:39,730 --> 00:46:42,100
So you can have a missing
parallel.

1082
00:46:42,100 --> 00:46:43,230
OK?

1083
00:46:43,230 --> 00:46:46,210
So this is the kind of
interesting thing

1084
00:46:46,210 --> 00:46:46,890
that is going on.

1085
00:46:46,890 --> 00:46:51,150
So next, I want to look at
something a little bit more

1086
00:46:51,150 --> 00:46:52,130
complicated.

1087
00:46:52,130 --> 00:46:54,400
So let's look at this.

1088
00:46:54,400 --> 00:46:59,850
So here's a classic algorithm
called successive over

1089
00:46:59,850 --> 00:47:01,320
relaxation.

1090
00:47:01,320 --> 00:47:05,050
So it kind of simulates a lot of
times things like heat flow

1091
00:47:05,050 --> 00:47:06,460
through a plane.

1092
00:47:06,460 --> 00:47:08,830
So the idea there is--

1093
00:47:08,830 --> 00:47:10,500
let me illustrate
what he does.

1094
00:47:10,500 --> 00:47:16,750
So assume you have a
big metal sheet.

1095
00:47:16,750 --> 00:47:20,790
And you put some kind of heat
source in one place.

1096
00:47:20,790 --> 00:47:23,640
And after sometime, it all
reaches a steady state.

1097
00:47:23,640 --> 00:47:24,920
The other side might be cold.

1098
00:47:24,920 --> 00:47:28,190
And you want to know part of
the sheet's temperature.

1099
00:47:31,070 --> 00:47:34,700
Because temperature
can leak out.

1100
00:47:34,700 --> 00:47:40,515
And there are more things like
you have a heat source and

1101
00:47:40,515 --> 00:47:44,480
others that [UNINTELLIGIBLE]
to work a glass of water or

1102
00:47:44,480 --> 00:47:45,300
some kind of a sink.

1103
00:47:45,300 --> 00:47:47,650
So what is the heat
distribution?

1104
00:47:47,650 --> 00:47:51,120
So one way to compare that this
is basically the same

1105
00:47:51,120 --> 00:47:51,550
[UNINTELLIGIBLE].

1106
00:47:51,550 --> 00:47:55,230
The heat value here is basically
the average around

1107
00:47:55,230 --> 00:47:59,690
all these other values
right now.

1108
00:47:59,690 --> 00:48:01,700
Because if something is too
hot, the heat is going to

1109
00:48:01,700 --> 00:48:03,040
propagate something
that is too cold.

1110
00:48:03,040 --> 00:48:04,800
The heat is going to propagate
because it kind of has to be

1111
00:48:04,800 --> 00:48:06,820
average around that.

1112
00:48:06,820 --> 00:48:07,840
Then, you take the
average in here.

1113
00:48:07,840 --> 00:48:09,260
So what it's doing
is calculating

1114
00:48:09,260 --> 00:48:11,200
the average in here.

1115
00:48:11,200 --> 00:48:12,810
And then, you have to do
it many, many times.

1116
00:48:12,810 --> 00:48:14,480
So if you have a heat source,
at that point, it

1117
00:48:14,480 --> 00:48:15,300
would be very hard.

1118
00:48:15,300 --> 00:48:17,790
And then, it will start
propagating slowly and kind of

1119
00:48:17,790 --> 00:48:18,810
propagate down.

1120
00:48:18,810 --> 00:48:21,890
And the cold side in this way
or after running many times,

1121
00:48:21,890 --> 00:48:23,180
it basically stabilizes.

1122
00:48:23,180 --> 00:48:25,350
And at that point, you have the
kind of heat distribution

1123
00:48:25,350 --> 00:48:26,180
that we [UNINTELLIGIBLE] have.

1124
00:48:26,180 --> 00:48:28,220
This is the kind of calculation
you do.

1125
00:48:28,220 --> 00:48:30,070
So this is the calculation.

1126
00:48:30,070 --> 00:48:31,410
So what you're doing is
calculating this.

1127
00:48:31,410 --> 00:48:35,220
You are creating this,
this, this, and this

1128
00:48:35,220 --> 00:48:36,370
and updating that.

1129
00:48:36,370 --> 00:48:38,480
And then, you do it
for t time stamps.

1130
00:48:38,480 --> 00:48:41,380
So you just go around doing each
of these things first and

1131
00:48:41,380 --> 00:48:44,284
doing it for t time stamps.

1132
00:48:44,284 --> 00:48:44,740
OK?

1133
00:48:44,740 --> 00:48:46,608
So we would like to
run this parallel.

1134
00:48:49,620 --> 00:48:53,650
So here's my basically
data space.

1135
00:48:53,650 --> 00:48:55,270
There's my data items.

1136
00:48:55,270 --> 00:48:56,980
So here's my array,
two-dimensional array.

1137
00:48:56,980 --> 00:48:58,540
So this is how I'm
trying to update.

1138
00:48:58,540 --> 00:49:00,170
I'm reading all this file.

1139
00:49:00,170 --> 00:49:01,760
So here's my iteration space.

1140
00:49:01,760 --> 00:49:03,050
So what I have looked at this.

1141
00:49:03,050 --> 00:49:04,590
I don't want to--

1142
00:49:04,590 --> 00:49:06,400
it's hard to give you
a 3D diagram.

1143
00:49:06,400 --> 00:49:08,190
I don't have a 3D projector.

1144
00:49:08,190 --> 00:49:11,160
So what I'm showing here is
three-dimension here.

1145
00:49:11,160 --> 00:49:13,850
So this is the previous
iteration, first iteration.

1146
00:49:13,850 --> 00:49:15,990
So if I still go tij.

1147
00:49:15,990 --> 00:49:19,800
So you go through t, and then
you go through i in here, and

1148
00:49:19,800 --> 00:49:21,560
then, when you're done, you
go to the [UNINTELLIGIBLE]

1149
00:49:21,560 --> 00:49:23,440
iteration and you go this way.

1150
00:49:23,440 --> 00:49:24,820
So here's how you
would iterate.

1151
00:49:24,820 --> 00:49:27,300
So you run this one, this one,
this one, this one, this one,

1152
00:49:27,300 --> 00:49:28,530
this one, this one, this
one, this one.

1153
00:49:28,530 --> 00:49:31,500
And then increase t by
1, and go like this.

1154
00:49:31,500 --> 00:49:35,370
And right now, we are here.

1155
00:49:35,370 --> 00:49:37,140
We are trying to update
this one.

1156
00:49:37,140 --> 00:49:39,080
That's what we are
trying to do.

1157
00:49:39,080 --> 00:49:42,570
And that means we are already
finished up to this point.

1158
00:49:42,570 --> 00:49:45,760
All these points are
finished up.

1159
00:49:45,760 --> 00:49:50,160
Now, what we have to do is
figure out when I'm reading,

1160
00:49:50,160 --> 00:49:53,480
who actually wrote this value.

1161
00:49:53,480 --> 00:49:53,920
OK?

1162
00:49:53,920 --> 00:49:56,830
First of all, let's figure out
which iterations might be able

1163
00:49:56,830 --> 00:49:58,430
to write this value.

1164
00:49:58,430 --> 00:50:04,460
So if you look at
this value, this

1165
00:50:04,460 --> 00:50:06,110
relationship in between here.

1166
00:50:06,110 --> 00:50:09,060
This one, basically, is ij.

1167
00:50:09,060 --> 00:50:11,030
And this is ij, ij, ij.

1168
00:50:11,030 --> 00:50:13,770
These three iterations
can write this one.

1169
00:50:13,770 --> 00:50:17,650
So and these iterations
can write this one.

1170
00:50:17,650 --> 00:50:19,770
Let me go to this one.

1171
00:50:19,770 --> 00:50:22,480
This is a pretty darn
complicated [UNINTELLIGIBLE].

1172
00:50:22,480 --> 00:50:26,650
So what that means is in
this one, this one

1173
00:50:26,650 --> 00:50:28,870
already wrote something.

1174
00:50:28,870 --> 00:50:30,990
This is what I'm reading
in here.

1175
00:50:30,990 --> 00:50:32,500
This one already wrote
something.

1176
00:50:32,500 --> 00:50:33,355
This is what I'm reading here.

1177
00:50:33,355 --> 00:50:34,390
This iteration wrote
something.

1178
00:50:34,390 --> 00:50:35,930
I read it here.

1179
00:50:35,930 --> 00:50:36,200
OK.

1180
00:50:36,200 --> 00:50:38,850
Everybody following so far?

1181
00:50:38,850 --> 00:50:40,170
How about this, guys?

1182
00:50:40,170 --> 00:50:42,665
Who wrote the value I am reading
in these iterations?

1183
00:50:47,530 --> 00:50:50,300
In this one, I haven't
reached there yet.

1184
00:50:50,300 --> 00:50:53,130
So who has written that?

1185
00:50:53,130 --> 00:50:56,812
So I assume this
is t equals 1.

1186
00:50:56,812 --> 00:50:57,630
[UNINTELLIGIBLE]

1187
00:50:57,630 --> 00:50:59,260
somebody has to write
those things.

1188
00:50:59,260 --> 00:51:03,150
So what that means is this also
wrote all of those values

1189
00:51:03,150 --> 00:51:04,520
because I have done
those iterations.

1190
00:51:04,520 --> 00:51:07,190
But the interesting thing is
some of these values got

1191
00:51:07,190 --> 00:51:07,630
overwritten.

1192
00:51:07,630 --> 00:51:09,650
This value got overwritten ,
this value got overwritten.

1193
00:51:09,650 --> 00:51:13,310
So these two values disappear.

1194
00:51:13,310 --> 00:51:15,720
This value got overwritten
by this guy.

1195
00:51:15,720 --> 00:51:20,080
This value got overwritten
by this guy.

1196
00:51:20,080 --> 00:51:20,280
OK.

1197
00:51:20,280 --> 00:51:22,400
But we haven't overwritten this
value, this value, and

1198
00:51:22,400 --> 00:51:23,310
this value yet.

1199
00:51:23,310 --> 00:51:25,850
This one, basically,
I've just updated.

1200
00:51:25,850 --> 00:51:26,620
But I [UNINTELLIGIBLE]

1201
00:51:26,620 --> 00:51:28,894
this one.

1202
00:51:28,894 --> 00:51:30,676
Do you see this?

1203
00:51:30,676 --> 00:51:33,954
Is everybody following me?

1204
00:51:33,954 --> 00:51:34,906
AUDIENCE: Once again, sir.

1205
00:51:34,906 --> 00:51:36,810
I got lost.

1206
00:51:36,810 --> 00:51:40,630
So what are [INAUDIBLE]

1207
00:51:40,630 --> 00:51:42,640
PROFESSOR: So what happens is I
am trying to update in this

1208
00:51:42,640 --> 00:51:46,780
iteration because this array
get rid of multiple times.

1209
00:51:46,780 --> 00:51:50,200
But in each iteration, you are
only doing one update.

1210
00:51:50,200 --> 00:51:52,220
So I am trying to read
and write in here.

1211
00:51:52,220 --> 00:51:54,380
So I need to read all
of these five

1212
00:51:54,380 --> 00:51:56,010
elements in this iteration.

1213
00:51:56,010 --> 00:51:58,450
So I want to figure out
who wrote that.

1214
00:51:58,450 --> 00:51:59,550
OK?

1215
00:51:59,550 --> 00:52:03,180
This one can be written by this
guy and this iteration.

1216
00:52:03,180 --> 00:52:05,440
Could this iteration write
its value in here?

1217
00:52:08,110 --> 00:52:09,000
OK?

1218
00:52:09,000 --> 00:52:09,420
[UNINTELLIGIBLE]

1219
00:52:09,420 --> 00:52:11,990
This iteration write because
we see it's writing ij.

1220
00:52:11,990 --> 00:52:16,890
I mean my diagram is not that
great because I have three in

1221
00:52:16,890 --> 00:52:18,660
here and five in here.

1222
00:52:18,660 --> 00:52:19,890
So just bear with me on that.

1223
00:52:19,890 --> 00:52:21,395
So assume I am writing
ij in here.

1224
00:52:24,760 --> 00:52:28,150
So my iterations go from 1 to n,
but my data goes from 0 to

1225
00:52:28,150 --> 00:52:29,260
n plus 1, basically.

1226
00:52:29,260 --> 00:52:30,990
1 to n minus 1 iterations.

1227
00:52:30,990 --> 00:52:32,270
0 to n plus 1 data.

1228
00:52:32,270 --> 00:52:35,160
So data is bigger than iteration
space because of

1229
00:52:35,160 --> 00:52:36,290
[UNINTELLIGIBLE].

1230
00:52:36,290 --> 00:52:40,740
So what happens is when I'm in
this iteration, we'll say this

1231
00:52:40,740 --> 00:52:44,770
is 1 2 iteration.

1232
00:52:44,770 --> 00:52:47,590
I will write this value.

1233
00:52:47,590 --> 00:52:52,370
This iteration will also
write this value.

1234
00:52:52,370 --> 00:52:53,075
OK?

1235
00:52:53,075 --> 00:52:54,730
You see that?

1236
00:52:54,730 --> 00:52:56,400
All of these iterations
are the same.

1237
00:52:56,400 --> 00:52:58,170
This iteration we will also
write this value.

1238
00:52:58,170 --> 00:53:00,880
But right now, who is the
last guy who wrote it?

1239
00:53:00,880 --> 00:53:02,400
The last guy that wrote
it is this guy.

1240
00:53:05,125 --> 00:53:07,910
Because this iteration wrote
it, and after that, it got

1241
00:53:07,910 --> 00:53:09,060
ordered in this one.

1242
00:53:09,060 --> 00:53:11,330
But this one hadn't occurred
yet, so it hadn't been ordered

1243
00:53:11,330 --> 00:53:11,900
by this guy.

1244
00:53:11,900 --> 00:53:15,670
So the last guy who wrote
it was this one.

1245
00:53:15,670 --> 00:53:16,820
So that's why I had
to eliminate this.

1246
00:53:16,820 --> 00:53:19,630
But this data value--

1247
00:53:19,630 --> 00:53:21,400
I haven't executed this
iteration yet.

1248
00:53:21,400 --> 00:53:24,260
So nobody had written this
one in this time stamp.

1249
00:53:24,260 --> 00:53:26,310
So it has to be from the
previous time stamp.

1250
00:53:26,310 --> 00:53:32,300
So I read two values from the
current time stamp, three

1251
00:53:32,300 --> 00:53:33,870
values from the previous
time stamp.

1252
00:53:33,870 --> 00:53:35,120
These three values have
to come from the

1253
00:53:35,120 --> 00:53:35,980
previous time stamp.

1254
00:53:35,980 --> 00:53:38,670
These two values that come from
the current time stamp.

1255
00:53:38,670 --> 00:53:39,830
You see that?

1256
00:53:39,830 --> 00:53:40,105
OK.

1257
00:53:40,105 --> 00:53:41,350
Good.

1258
00:53:41,350 --> 00:53:45,250
So what that means is because
dependence means--

1259
00:53:45,250 --> 00:53:47,500
OK.

1260
00:53:47,500 --> 00:53:50,455
This line, this dark, red line.

1261
00:53:50,455 --> 00:53:51,160
See.

1262
00:53:51,160 --> 00:53:54,975
I am reading a value in a
current iteration that was

1263
00:53:54,975 --> 00:53:56,070
written by this iteration.

1264
00:53:56,070 --> 00:53:58,210
So that means I have no
dependence between these two

1265
00:53:58,210 --> 00:54:00,670
iterations.

1266
00:54:00,670 --> 00:54:00,960
OK.

1267
00:54:00,960 --> 00:54:03,930
This line, this dark,
red line.

1268
00:54:03,930 --> 00:54:06,020
I am reading a value written
by this iteration.

1269
00:54:06,020 --> 00:54:07,870
So I have a dependency
in here.

1270
00:54:07,870 --> 00:54:10,070
This line means I have a
dependence between this

1271
00:54:10,070 --> 00:54:11,750
iteration to the current one.

1272
00:54:11,750 --> 00:54:13,310
This line means I have
dependence between this

1273
00:54:13,310 --> 00:54:14,540
iteration and the current one.

1274
00:54:14,540 --> 00:54:15,980
This line means I have
dependence between this

1275
00:54:15,980 --> 00:54:18,980
iteration and the current one.

1276
00:54:18,980 --> 00:54:20,230
You see that?

1277
00:54:22,480 --> 00:54:23,320
OK.

1278
00:54:23,320 --> 00:54:28,560
So now, I want to see how we
can parallelize this group.

1279
00:54:28,560 --> 00:54:30,380
So what can I do?

1280
00:54:30,380 --> 00:54:32,200
So I look at all this
dependence.

1281
00:54:32,200 --> 00:54:34,720
At this point, I don't have to
think about all this where who

1282
00:54:34,720 --> 00:54:34,910
wrote what.

1283
00:54:34,910 --> 00:54:37,840
I can say this is dependence.

1284
00:54:37,840 --> 00:54:43,310
In order to do this equation,
all these iterations have to

1285
00:54:43,310 --> 00:54:47,440
be done because I am losing the
values produced by them.

1286
00:54:47,440 --> 00:54:49,740
So these have to be finished
before I can patch that.

1287
00:54:49,740 --> 00:54:51,760
So the parallelism
means I tried to

1288
00:54:51,760 --> 00:54:54,430
do things in parallel.

1289
00:54:54,430 --> 00:54:56,900
So can we parallelize
this loop?

1290
00:55:00,120 --> 00:55:01,630
Can we run each time
stamp separately?

1291
00:55:05,110 --> 00:55:08,940
No because I am using these
three values from the previous

1292
00:55:08,940 --> 00:55:10,100
time stamp.

1293
00:55:10,100 --> 00:55:14,030
So I can't run this time stamp,
B equals 1, until B

1294
00:55:14,030 --> 00:55:15,670
equals 0 is done.

1295
00:55:15,670 --> 00:55:16,460
Or B plus 2 [UNINTELLIGIBLE]

1296
00:55:16,460 --> 00:55:18,700
B plus 1 is done.

1297
00:55:18,700 --> 00:55:19,030
OK?

1298
00:55:19,030 --> 00:55:22,690
So I can't parallelize
this loop.

1299
00:55:22,690 --> 00:55:22,930
OK.

1300
00:55:22,930 --> 00:55:26,320
Can I parallelize this loop?

1301
00:55:26,320 --> 00:55:27,850
Why?

1302
00:55:27,850 --> 00:55:31,000
Will dependence stop me from
parallelizing this one?

1303
00:55:36,070 --> 00:55:37,710
So I'm looking at i.

1304
00:55:37,710 --> 00:55:40,180
This is my i dimension.

1305
00:55:40,180 --> 00:55:41,720
How many lines, at
least, tell me.

1306
00:55:41,720 --> 00:55:43,370
How many dependencies
are going to stop

1307
00:55:43,370 --> 00:55:46,850
me from doing that?

1308
00:55:46,850 --> 00:55:47,160
OK.

1309
00:55:47,160 --> 00:55:47,510
Good.

1310
00:55:47,510 --> 00:55:48,180
I have [UNINTELLIGIBLE].

1311
00:55:48,180 --> 00:55:49,290
Somebody says three.

1312
00:55:49,290 --> 00:55:50,750
Somebody says one.

1313
00:55:50,750 --> 00:55:51,030
OK.

1314
00:55:51,030 --> 00:55:51,930
Let's get a vote.

1315
00:55:51,930 --> 00:55:53,545
How many people think
it's three?

1316
00:55:56,390 --> 00:55:56,630
OK.

1317
00:55:56,630 --> 00:55:58,760
There's one vote for three.

1318
00:55:58,760 --> 00:56:00,060
How many people think
it's three?

1319
00:56:00,060 --> 00:56:01,310
How many people think
it's one?

1320
00:56:03,770 --> 00:56:04,530
Wait a minute.

1321
00:56:04,530 --> 00:56:08,100
One vote for three and
two votes for one?

1322
00:56:08,100 --> 00:56:08,350
OK.

1323
00:56:08,350 --> 00:56:10,100
Where's the rest?

1324
00:56:10,100 --> 00:56:10,930
For two?

1325
00:56:10,930 --> 00:56:12,110
For 0?

1326
00:56:12,110 --> 00:56:13,480
Can't be 0 if the
0 is parallel.

1327
00:56:13,480 --> 00:56:14,300
OK.

1328
00:56:14,300 --> 00:56:16,150
So we'll start parallelizing.

1329
00:56:16,150 --> 00:56:16,750
OK.

1330
00:56:16,750 --> 00:56:18,960
So what happens in here?

1331
00:56:18,960 --> 00:56:21,210
Right now, this is
actually one.

1332
00:56:21,210 --> 00:56:22,660
This one.

1333
00:56:22,660 --> 00:56:26,900
Because these things don't
participate because this has

1334
00:56:26,900 --> 00:56:28,350
already happened.

1335
00:56:28,350 --> 00:56:33,030
When you go to ij iterations,
these are already done.

1336
00:56:33,030 --> 00:56:34,170
So you're going from t.

1337
00:56:34,170 --> 00:56:36,160
So you're looking at the current
iterations because

1338
00:56:36,160 --> 00:56:38,680
you're ending in two loops.

1339
00:56:38,680 --> 00:56:39,740
So the t is done.

1340
00:56:39,740 --> 00:56:41,720
So these all are already
done when you go try

1341
00:56:41,720 --> 00:56:42,490
to parallelize sides.

1342
00:56:42,490 --> 00:56:44,740
So I don't have to worry
about these three.

1343
00:56:44,740 --> 00:56:48,150
In here because actually I'm
losing t of something here, I

1344
00:56:48,150 --> 00:56:51,480
am in trouble.

1345
00:56:51,480 --> 00:56:56,410
So when you go look at this
one, I have this one.

1346
00:56:56,410 --> 00:56:59,020
So every dimension has
a dependence in here.

1347
00:56:59,020 --> 00:57:00,750
So I can't run it in parallel.

1348
00:57:00,750 --> 00:57:02,740
So does this mean that there's
no parallelism?

1349
00:57:08,160 --> 00:57:09,410
Who think there's
no parallelism?

1350
00:57:12,410 --> 00:57:13,340
Who thinks there is?

1351
00:57:13,340 --> 00:57:14,370
Oh, somebody thinks there's
no parallelism.

1352
00:57:14,370 --> 00:57:16,680
Who thinks there's
parallelism?

1353
00:57:16,680 --> 00:57:17,170
OK.

1354
00:57:17,170 --> 00:57:18,080
More people think there's
parallelism.

1355
00:57:18,080 --> 00:57:20,090
Let's see what we can do.

1356
00:57:20,090 --> 00:57:21,889
Question?

1357
00:57:21,889 --> 00:57:24,354
AUDIENCE: Do you really
think [INAUDIBLE]

1358
00:57:34,214 --> 00:57:36,679
I'm trying to figure out
how to word this.

1359
00:57:36,679 --> 00:57:38,158
Do you really want to have

1360
00:57:38,158 --> 00:57:39,144
dependence on the same concept?

1361
00:57:39,144 --> 00:57:40,394
[INAUDIBLE]?

1362
00:57:43,620 --> 00:57:43,880
PROFESSOR: Yeah.

1363
00:57:43,880 --> 00:57:45,730
I mean you can do--

1364
00:57:45,730 --> 00:57:48,810
this is the way this SOR
is sitting so there's a

1365
00:57:48,810 --> 00:57:50,420
dependence between time stamp.

1366
00:57:50,420 --> 00:57:51,750
There's another SOR.

1367
00:57:51,750 --> 00:57:53,380
What they do is kind
of a red, black.

1368
00:57:53,380 --> 00:57:56,400
So when you calculate the next
time stamp, you calculate it

1369
00:57:56,400 --> 00:57:57,760
right and complete
the new array.

1370
00:57:57,760 --> 00:57:58,560
So there's no dependence.

1371
00:57:58,560 --> 00:58:00,900
So that's a different
algorithm.

1372
00:58:00,900 --> 00:58:03,800
This algorithm, basically,
uses sum value from

1373
00:58:03,800 --> 00:58:05,730
[UNINTELLIGIBLE] because the
value-- the algorithm you're

1374
00:58:05,730 --> 00:58:06,985
talking-- you already created
the other copies.

1375
00:58:06,985 --> 00:58:07,890
You had two copies.

1376
00:58:07,890 --> 00:58:09,170
You're bouncing back
and forth.

1377
00:58:09,170 --> 00:58:09,530
Nice.

1378
00:58:09,530 --> 00:58:11,100
No real problem in here.

1379
00:58:11,100 --> 00:58:12,760
But then you had to have twice
the amount of storage.

1380
00:58:12,760 --> 00:58:14,320
Here, you are updating in.

1381
00:58:14,320 --> 00:58:17,740
And since this is kind of
running enough iterations

1382
00:58:17,740 --> 00:58:23,190
until it converges, it doesn't
seem to matter that the

1383
00:58:23,190 --> 00:58:24,440
[UNINTELLIGIBLE PHRASE].

1384
00:58:26,520 --> 00:58:27,280
OK.

1385
00:58:27,280 --> 00:58:32,380
So we cannot find a loop,
what we call doall loop.

1386
00:58:32,380 --> 00:58:35,380
The doall loop means there's no
loop carried dependences.

1387
00:58:35,380 --> 00:58:36,930
It's fully parallel.

1388
00:58:36,930 --> 00:58:38,900
This is the best case.

1389
00:58:38,900 --> 00:58:41,470
So what happens is when
you get there,

1390
00:58:41,470 --> 00:58:42,620
everybody can run parallel.

1391
00:58:42,620 --> 00:58:44,770
And when you're done, you can
stop and then do that.

1392
00:58:44,770 --> 00:58:46,540
So this is the doall loop.

1393
00:58:46,540 --> 00:58:47,560
Of course, there's
no doall loop.

1394
00:58:47,560 --> 00:58:48,830
We can look at every
dimension.

1395
00:58:48,830 --> 00:58:50,290
We had some kind
of dependence.

1396
00:58:50,290 --> 00:58:53,430
So there's another choice, what
we call doacross loop.

1397
00:58:53,430 --> 00:58:57,200
What that means is we have some
loop carried dependence.

1398
00:58:57,200 --> 00:58:58,760
There's something I
have to use for

1399
00:58:58,760 --> 00:59:00,320
the previous iteration.

1400
00:59:00,320 --> 00:59:01,910
But it's only one thing.

1401
00:59:01,910 --> 00:59:05,150
I have a lot of other things I
can run around that only I

1402
00:59:05,150 --> 00:59:06,190
just have to wait one thing.

1403
00:59:06,190 --> 00:59:06,820
One is done.

1404
00:59:06,820 --> 00:59:08,190
I can just keep running.

1405
00:59:08,190 --> 00:59:11,020
And if I calculate and send this
one early, then I can do

1406
00:59:11,020 --> 00:59:13,250
my other calculations later.

1407
00:59:13,250 --> 00:59:14,160
This is not that great.

1408
00:59:14,160 --> 00:59:15,820
If you look at the
difference here.

1409
00:59:15,820 --> 00:59:19,350
This definitely has very little
overhead in here.

1410
00:59:19,350 --> 00:59:21,330
This can run slow.

1411
00:59:21,330 --> 00:59:23,050
And of course, this thing
gets produced very late.

1412
00:59:23,050 --> 00:59:24,790
It's [? almost ?] sequential.

1413
00:59:24,790 --> 00:59:27,890
So I hope you can just-- it the
other guy wants something,

1414
00:59:27,890 --> 00:59:29,810
I can immediately send
it very early.

1415
00:59:29,810 --> 00:59:31,560
And then I can run there.

1416
00:59:31,560 --> 00:59:35,600
So you can get some kind of
doacross patterns in here.

1417
00:59:35,600 --> 00:59:37,690
So if you want to
do this one--

1418
00:59:37,690 --> 00:59:39,710
this is a little bit
crazy in here.

1419
00:59:39,710 --> 00:59:41,580
But they'll do it in here.

1420
00:59:41,580 --> 00:59:44,260
And so what first we are to
do is you are to say, OK.

1421
00:59:44,260 --> 00:59:44,640
Look.

1422
00:59:44,640 --> 00:59:48,570
I'm running this loop, the
i loop in parallel.

1423
00:59:48,570 --> 00:59:52,410
But I have to exchange
some data.

1424
00:59:52,410 --> 00:59:55,730
Before I want to run this one,
I have to basically get the

1425
00:59:55,730 --> 00:59:58,060
previous i value produced.

1426
00:59:58,060 --> 01:00:00,620
And when it's done, I can say
the next guy can use it.

1427
01:00:00,620 --> 01:00:02,490
So this is a very
complicated one.

1428
01:00:02,490 --> 01:00:05,430
I don't want you to understand
it too well.

1429
01:00:05,430 --> 01:00:09,210
So the reason I put it is to
show that OK, if you want to

1430
01:00:09,210 --> 01:00:12,080
spend a week trying to really
call this up and understand

1431
01:00:12,080 --> 01:00:14,230
and make sure that
it works OK.

1432
01:00:14,230 --> 01:00:16,930
So you can do things
like that.

1433
01:00:16,930 --> 01:00:17,910
OK?

1434
01:00:17,910 --> 01:00:18,570
Aha.

1435
01:00:18,570 --> 01:00:21,410
So this is the true
voodooness.

1436
01:00:21,410 --> 01:00:23,170
OK.

1437
01:00:23,170 --> 01:00:28,150
AUDIENCE: So in Cilk, if you
do this with divide and

1438
01:00:28,150 --> 01:00:34,400
conquer, you can make it be what
I called in the Tableau

1439
01:00:34,400 --> 01:00:35,760
construction.

1440
01:00:35,760 --> 01:00:38,820
Each layer here is basically
constructing a Tableau.

1441
01:00:38,820 --> 01:00:41,078
And so if you do it with divide
and conquer, you can do

1442
01:00:41,078 --> 01:00:44,670
it with a very simple
recursive code.

1443
01:00:44,670 --> 01:00:49,160
But you can also do it with a
loop that goes diagonally.

1444
01:00:49,160 --> 01:00:49,260
AUDIENCE: [INTERPOSING VOICES]

1445
01:00:49,260 --> 01:00:49,390
PROFESSOR: Yes.

1446
01:00:49,390 --> 01:00:50,460
I'm going to get that.

1447
01:00:50,460 --> 01:00:52,690
That's next.

1448
01:00:52,690 --> 01:00:53,770
AUDIENCE: Sorry.

1449
01:00:53,770 --> 01:00:54,540
PROFESSOR: That's OK.

1450
01:00:54,540 --> 01:01:01,080
So the reason that I'm showing
that is because this class is

1451
01:01:01,080 --> 01:01:05,210
not just about how to make the
cores exactly run faster.

1452
01:01:05,210 --> 01:01:07,630
Think about algorithmic issues
and stuff like that.

1453
01:01:07,630 --> 01:01:10,670
So sometimes, when you look at
a problem, it looks crazy.

1454
01:01:10,670 --> 01:01:13,240
And there might be some changes
you can do that you

1455
01:01:13,240 --> 01:01:16,590
can get to run things
in parallel.

1456
01:01:16,590 --> 01:01:18,890
So I'm actually doing
not diagonal.

1457
01:01:18,890 --> 01:01:21,030
I'm actually doing something
very simple.

1458
01:01:21,030 --> 01:01:26,120
So what I have done here
is I have all these

1459
01:01:26,120 --> 01:01:27,120
dependences in here.

1460
01:01:27,120 --> 01:01:27,660
OK?

1461
01:01:27,660 --> 01:01:34,380
So the problem here is I can't
find a single [UNINTELLIGIBLE]

1462
01:01:34,380 --> 01:01:37,390
that basically has
no crossing.

1463
01:01:37,390 --> 01:01:39,500
But if you look at this
[UNINTELLIGIBLE]

1464
01:01:39,500 --> 01:01:41,140
diagonal here.

1465
01:01:41,140 --> 01:01:44,780
What you see is, in fact,
there's nothing that crosses

1466
01:01:44,780 --> 01:01:46,030
the diagonal.

1467
01:01:48,370 --> 01:01:48,880
OK?

1468
01:01:48,880 --> 01:01:51,600
So this one basically
doesn't depend on

1469
01:01:51,600 --> 01:01:52,790
this one or this one.

1470
01:01:52,790 --> 01:01:54,480
It only depends on
the previous one.

1471
01:01:54,480 --> 01:01:57,230
So I can run everything in
the diagonal parallel in

1472
01:01:57,230 --> 01:01:58,560
here in this one.

1473
01:01:58,560 --> 01:02:01,247
So of course, I can't write
anything [UNINTELLIGIBLE] in

1474
01:02:01,247 --> 01:02:03,260
here, but there's a cute
trick you can do.

1475
01:02:03,260 --> 01:02:06,400
What you can do is you
can take iteration

1476
01:02:06,400 --> 01:02:07,910
space and skew it.

1477
01:02:10,620 --> 01:02:15,300
So what I have done is now
instead off the same thing,

1478
01:02:15,300 --> 01:02:17,851
instead of this being a square,
now I skewed it a

1479
01:02:17,851 --> 01:02:20,250
little bit.

1480
01:02:20,250 --> 01:02:20,790
OK?

1481
01:02:20,790 --> 01:02:28,730
So what that means is when I'm
running first i, I basically

1482
01:02:28,730 --> 01:02:29,530
don't run any here.

1483
01:02:29,530 --> 01:02:32,380
Then, I run this one and
this iteration here.

1484
01:02:32,380 --> 01:02:34,220
So what I have done is I have
kind of moved my iteration

1485
01:02:34,220 --> 01:02:35,090
space around.

1486
01:02:35,090 --> 01:02:38,810
Do you see how this might be?

1487
01:02:38,810 --> 01:02:42,900
So now, the interesting thing
is when I skew, if I look at

1488
01:02:42,900 --> 01:02:56,635
this line, I can parallelize
in this one because all the

1489
01:02:56,635 --> 01:02:58,490
dependences come from the
previous iteration.

1490
01:02:58,490 --> 01:02:59,150
Am I right?

1491
01:02:59,150 --> 01:03:00,400
[UNINTELLIGIBLE]

1492
01:03:03,532 --> 01:03:04,000
Yeah.

1493
01:03:04,000 --> 01:03:05,250
I skewed it.

1494
01:03:10,864 --> 01:03:16,440
Yes, everything in here, these
ones are parallel.

1495
01:03:16,440 --> 01:03:16,990
OK?

1496
01:03:16,990 --> 01:03:21,070
And any dependence comes from
the previous iteration.

1497
01:03:21,070 --> 01:03:22,670
There's no current iteration
in here.

1498
01:03:22,670 --> 01:03:24,580
Everything in this
one is parallel.

1499
01:03:24,580 --> 01:03:26,960
So I can parallelize this.

1500
01:03:26,960 --> 01:03:29,020
So this one doesn't depend
on this one or this one.

1501
01:03:29,020 --> 01:03:32,000
So this is all parallel.

1502
01:03:32,000 --> 01:03:33,570
This is a little bit
more complicated.

1503
01:03:33,570 --> 01:03:36,990
So if you're interested
to go deep, just go

1504
01:03:36,990 --> 01:03:38,005
stare at the slides.

1505
01:03:38,005 --> 01:03:40,460
I have the slides out there to
understand how that happens.

1506
01:03:40,460 --> 01:03:43,170
So if you think about what I'm
running here in parallel is

1507
01:03:43,170 --> 01:03:47,660
the one basically this
diagonal in here.

1508
01:03:47,660 --> 01:03:50,870
So what happens is if you run
this, this, and this parallel,

1509
01:03:50,870 --> 01:03:51,690
there's no dependence.

1510
01:03:51,690 --> 01:03:54,600
I don't need this one or this
one to run this one.

1511
01:03:54,600 --> 01:03:56,550
So I can run this, this,
this, this, all

1512
01:03:56,550 --> 01:03:58,010
this diagonal in parallel.

1513
01:03:58,010 --> 01:03:59,680
But the trouble with just the
diagonal is I don't have a

1514
01:03:59,680 --> 01:04:01,850
place in here to say
[UNINTELLIGIBLE]

1515
01:04:01,850 --> 01:04:02,560
for a diagonal.

1516
01:04:02,560 --> 01:04:04,940
So I basically skewed
it and then made a

1517
01:04:04,940 --> 01:04:06,760
diagonal into one loop.

1518
01:04:06,760 --> 01:04:12,780
So then, now what happens
is basically j

1519
01:04:12,780 --> 01:04:15,850
loop I can run parallel.

1520
01:04:15,850 --> 01:04:17,800
This one.

1521
01:04:17,800 --> 01:04:20,450
So I can do it four
[UNINTELLIGIBLE] four.

1522
01:04:20,450 --> 01:04:21,070
OK?

1523
01:04:21,070 --> 01:04:27,910
So here's something you found
a problem that has no nice

1524
01:04:27,910 --> 01:04:28,200
parallelism.

1525
01:04:28,200 --> 01:04:31,370
But you realize there's kind of
a what you call a wavefront

1526
01:04:31,370 --> 01:04:32,550
going on here.

1527
01:04:32,550 --> 01:04:33,530
Wave going on here.

1528
01:04:33,530 --> 01:04:36,870
So not the given dimension, but
there's another dimension

1529
01:04:36,870 --> 01:04:37,690
that you can parallel.

1530
01:04:37,690 --> 01:04:41,090
So you kind of skewed your
space to get that nice

1531
01:04:41,090 --> 01:04:41,690
[UNINTELLIGIBLE] line.

1532
01:04:41,690 --> 01:04:44,210
And you run parallel.

1533
01:04:44,210 --> 01:04:47,100
So that's all I have for today.