1
00:00:00,030 --> 00:00:02,420
The following content is
provided under a Creative

2
00:00:02,420 --> 00:00:03,850
Commons license.

3
00:00:03,850 --> 00:00:06,860
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,860 --> 00:00:10,550
offer high quality educational
resources for free.

5
00:00:10,550 --> 00:00:13,420
To make a donation or view
additional materials from

6
00:00:13,420 --> 00:00:17,510
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,510 --> 00:00:18,760
ocw.mit.edu.

8
00:00:21,140 --> 00:00:23,450
PROFESSOR: So we'll
get started.

9
00:00:23,450 --> 00:00:27,240
So today we are going to
dive into some parallel

10
00:00:27,240 --> 00:00:28,610
architectures.

11
00:00:28,610 --> 00:00:36,070
So the way, if you look at the
big world, is there's --

12
00:00:36,070 --> 00:00:39,370
just counting parallelism, you
can do it implicitly, either

13
00:00:39,370 --> 00:00:40,770
by hardware or the compiler.

14
00:00:40,770 --> 00:00:42,220
So the user won't see it.

15
00:00:42,220 --> 00:00:44,870
It will be done behind the
user's back, but can be done

16
00:00:44,870 --> 00:00:46,070
by hardware or compiler.

17
00:00:46,070 --> 00:00:48,980
Or explicitly, visible
to the user.

18
00:00:48,980 --> 00:00:53,440
So the hardware part is done in
superscalar processors, and

19
00:00:53,440 --> 00:00:55,550
all those things will have
explicitly parallel

20
00:00:55,550 --> 00:00:56,590
architecture.

21
00:00:56,590 --> 00:01:02,010
So what I am going to do is
spend some time just talking

22
00:01:02,010 --> 00:01:05,320
about implicitly parallel
superscalar processors.

23
00:01:05,320 --> 00:01:08,290
Because probably the entire time
you guys were born till

24
00:01:08,290 --> 00:01:11,220
now, this has been the
mainstream, people are

25
00:01:11,220 --> 00:01:13,220
building these things,
and we are use to it.

26
00:01:13,220 --> 00:01:14,980
And now we are kind
of doing a switch.

27
00:01:14,980 --> 00:01:17,480
Then we'll go into explicit
parallelism processors and

28
00:01:17,480 --> 00:01:20,310
kind of look at different types
in there, and get a feel

29
00:01:20,310 --> 00:01:23,270
for the big picture.

30
00:01:23,270 --> 00:01:26,140
So let's start at implicitly
parallel superscalar

31
00:01:26,140 --> 00:01:27,350
processors.

32
00:01:27,350 --> 00:01:29,160
So there are two types of
superscalar processors.

33
00:01:29,160 --> 00:01:31,140
One is what we call statically
scheduled.

34
00:01:31,140 --> 00:01:34,330
Those are kind of simpler ones,
where you use compiler

35
00:01:34,330 --> 00:01:36,620
techniques to figure out where
the parallelism is.

36
00:01:36,620 --> 00:01:40,280
And what happens is the computer
keeps executing,

37
00:01:40,280 --> 00:01:42,600
intead of one instruction at a
time, the few instructions

38
00:01:42,600 --> 00:01:44,440
next to each other
in one bunch.

39
00:01:44,440 --> 00:01:47,180
Like a bundle after
bundle type thing.

40
00:01:47,180 --> 00:01:49,570
On the other hand, dynamically
scheduled processors --

41
00:01:49,570 --> 00:01:52,260
things like the current Pentiums
-- are a lot more

42
00:01:52,260 --> 00:01:52,850
complicated.

43
00:01:52,850 --> 00:01:55,540
They have to extract instruction
level parallelism.

44
00:01:55,540 --> 00:01:58,020
ILP doesn't mean integer linear
programming, it's

45
00:01:58,020 --> 00:02:00,290
instruction level parallelism.

46
00:02:00,290 --> 00:02:03,840
Schedule them as soon as
operands become available,

47
00:02:03,840 --> 00:02:06,440
when the data is able to
run these instructions.

48
00:02:06,440 --> 00:02:08,520
Then there's just a bunch of
things that get more about

49
00:02:08,520 --> 00:02:10,610
parallelism, things like rename
registers to eliminate

50
00:02:10,610 --> 00:02:11,840
some dependences.

51
00:02:11,840 --> 00:02:14,170
You execute things
out of order.

52
00:02:14,170 --> 00:02:16,580
If later instructions the
operands become available

53
00:02:16,580 --> 00:02:19,300
early, you'll get those things
done instead of waiting.

54
00:02:19,300 --> 00:02:20,770
You can speculate to execute.

55
00:02:20,770 --> 00:02:23,430
I'll go through a little bit in
detail to kind of explain

56
00:02:23,430 --> 00:02:24,680
what these things might be.

57
00:02:27,774 --> 00:02:29,680
Why is this not going down.

58
00:02:29,680 --> 00:02:32,180
Oops.

59
00:02:32,180 --> 00:02:36,540
So if you look at a
normal pipeline.

60
00:02:36,540 --> 00:02:40,230
So this is a 004
type pipeline.

61
00:02:40,230 --> 00:02:43,790
What I have is a very
simplistic four

62
00:02:43,790 --> 00:02:45,210
stage pipeline in there.

63
00:02:45,210 --> 00:02:48,050
So a normal microprocessor,
a single-issue, will do

64
00:02:48,050 --> 00:02:50,080
something like this.

65
00:02:50,080 --> 00:02:52,050
And if you look at it, there's
still a little bit of

66
00:02:52,050 --> 00:02:53,680
parallelism here.

67
00:02:53,680 --> 00:02:56,320
Because you don't wait till the
first thing finishes to go

68
00:02:56,320 --> 00:02:58,250
to the second thing.

69
00:02:58,250 --> 00:03:01,670
If you look at a superscalar,
you have something like this.

70
00:03:01,670 --> 00:03:03,760
This is an in-order
superscalar.

71
00:03:03,760 --> 00:03:07,020
What happens is in every cycle
instead of doing one, you

72
00:03:07,020 --> 00:03:10,360
fetch two, you decode two,
you execute two, and

73
00:03:10,360 --> 00:03:12,080
so on and so forth.

74
00:03:12,080 --> 00:03:15,070
In an out-of-order super-scalar,
these are not

75
00:03:15,070 --> 00:03:16,670
going in these very
nice boundaries.

76
00:03:16,670 --> 00:03:19,930
You have a fetch unit that
fetches like hundreds ahead,

77
00:03:19,930 --> 00:03:22,580
and it keeps issuing as soon
as things are fetched and

78
00:03:22,580 --> 00:03:24,540
decoded to the execute unit.

79
00:03:24,540 --> 00:03:26,600
And it's a lot more of a complex
picture in there.

80
00:03:26,600 --> 00:03:29,260
I'm not going to show too much
of the picture there, because

81
00:03:29,260 --> 00:03:31,830
it's a very complicated thing.

82
00:03:31,830 --> 00:03:35,910
So the first thing the processor
has to do is, it has

83
00:03:35,910 --> 00:03:38,890
to look for true data
dependences.

84
00:03:38,890 --> 00:03:42,300
True data dependence says that
this instruction in fact is

85
00:03:42,300 --> 00:03:45,990
using something produced
by the previous guy.

86
00:03:45,990 --> 00:03:50,395
So this is important because
if the two instructions are

87
00:03:50,395 --> 00:03:53,410
data dependent, they cannot be
executed simultaneously.

88
00:03:53,410 --> 00:03:54,590
You to wait till the first
guy finishes to

89
00:03:54,590 --> 00:03:55,470
get the second guy.

90
00:03:55,470 --> 00:03:58,480
It cannot be completely
overlapped, and you can't

91
00:03:58,480 --> 00:03:59,910
execute them in out-of-order.

92
00:03:59,910 --> 00:04:01,930
You have to make sure the
data comes in before you

93
00:04:01,930 --> 00:04:03,180
actually use it.

94
00:04:05,190 --> 00:04:09,360
In computer architecture jargon,
this is called a

95
00:04:09,360 --> 00:04:10,340
pipeline hazard.

96
00:04:10,340 --> 00:04:12,120
And this is called a
Read After Write

97
00:04:12,120 --> 00:04:13,750
hazard, or RAW hazard.

98
00:04:13,750 --> 00:04:18,370
What that means is that the
write has to finish before you

99
00:04:18,370 --> 00:04:20,800
can do the read.

100
00:04:20,800 --> 00:04:23,660
In a microprocessor, people try
very hard to minimize the

101
00:04:23,660 --> 00:04:26,490
time you have to wait to do
that, and you really have to

102
00:04:26,490 --> 00:04:27,740
honor that.

103
00:04:32,780 --> 00:04:34,550
In hardware/software what you
have to do is you have to

104
00:04:34,550 --> 00:04:38,330
preserve this program
ordering.

105
00:04:38,330 --> 00:04:41,820
The program has to be executed
sequentially, determined by

106
00:04:41,820 --> 00:04:42,630
the source program.

107
00:04:42,630 --> 00:04:44,560
So if the source program says
some order of doing things,

108
00:04:44,560 --> 00:04:44,900
you better --

109
00:04:44,900 --> 00:04:46,330
if there's some reason for
doing that, you better

110
00:04:46,330 --> 00:04:48,200
actually adhere to that order.

111
00:04:48,200 --> 00:04:51,390
You can't go and just do things
in a haphazard way.

112
00:04:51,390 --> 00:04:55,410
And dependences are basically
a fact of the

113
00:04:55,410 --> 00:04:56,940
program, so what you got.

114
00:04:56,940 --> 00:04:58,570
If you're lucky you'll get a
program without too many

115
00:04:58,570 --> 00:05:01,160
dependences, but most probably
you'll get programs that have

116
00:05:01,160 --> 00:05:02,010
a lot of dependences.

117
00:05:02,010 --> 00:05:03,260
That's normal.

118
00:05:06,050 --> 00:05:08,170
There's a lot of importance
of the data dependence.

119
00:05:08,170 --> 00:05:10,120
It indicates the possibility
of these hazards, how these

120
00:05:10,120 --> 00:05:11,590
dependences have to work.

121
00:05:11,590 --> 00:05:14,020
And it determines the order in
which the results might be

122
00:05:14,020 --> 00:05:18,755
calculated, because if you need
the result of that to do

123
00:05:18,755 --> 00:05:21,700
the next, you have what you
call a dependency chain.

124
00:05:21,700 --> 00:05:23,730
And you have to excute
that in that order.

125
00:05:23,730 --> 00:05:26,930
And because of the dependency
chain, it sets an upper bound

126
00:05:26,930 --> 00:05:29,980
of how much parallelism that
can be possibly expected.

127
00:05:29,980 --> 00:05:32,230
If you can say in all your
program there's nothing

128
00:05:32,230 --> 00:05:35,190
dependent -- every instruction
just can go any time --

129
00:05:35,190 --> 00:05:38,390
then you can say the best
computer will get done in one

130
00:05:38,390 --> 00:05:40,180
cycle, because everything
can run.

131
00:05:40,180 --> 00:05:43,285
But if you say the next
instruction is dependent on

132
00:05:43,285 --> 00:05:44,760
the previous one, the next
instruction is dependent on

133
00:05:44,760 --> 00:05:46,610
the previous one, you
have a chain.

134
00:05:46,610 --> 00:05:48,620
And no matter how good the
hardware, you have to wait

135
00:05:48,620 --> 00:05:50,580
till that chain finishes.

136
00:05:50,580 --> 00:05:54,310
And you don't get that
much parallelism.

137
00:05:54,310 --> 00:05:57,150
So the goal is to exploit
parallelism by preserving the

138
00:05:57,150 --> 00:06:01,290
program order where it affects
the outcome of the program.

139
00:06:01,290 --> 00:06:04,190
So if we want to have a look and
feel like the program is

140
00:06:04,190 --> 00:06:07,620
run on a nice single-issue
machine that does one

141
00:06:07,620 --> 00:06:09,760
instruction after another after
another, that's the

142
00:06:09,760 --> 00:06:10,690
world we are looking in.

143
00:06:10,690 --> 00:06:13,730
And then we are doing all this
underneath to kind of get

144
00:06:13,730 --> 00:06:17,540
performance, but give
that abstraction.

145
00:06:17,540 --> 00:06:21,370
So there are other dependences
that we can do better.

146
00:06:21,370 --> 00:06:23,850
There are two types of
name dependences.

147
00:06:23,850 --> 00:06:29,450
That means there's no real
program use of data, but there

148
00:06:29,450 --> 00:06:31,340
are limited resources
in the program.

149
00:06:31,340 --> 00:06:33,130
And you have resource
contentions.

150
00:06:33,130 --> 00:06:38,430
So the two types of resources
are registers and memory.

151
00:06:38,430 --> 00:06:40,610
So linear resource
contentions.

152
00:06:40,610 --> 00:06:45,830
The first name dependence is
what we call anti-dependence.

153
00:06:45,830 --> 00:06:48,230
Anti-dependence means that --

154
00:06:54,840 --> 00:06:57,640
what I need to do is, I want
to write this register.

155
00:06:57,640 --> 00:06:59,180
But in the previous instruction
I'm actually

156
00:06:59,180 --> 00:07:02,110
reading the register.

157
00:07:02,110 --> 00:07:03,460
Because I'm writing the
next one, I'm not

158
00:07:03,460 --> 00:07:05,270
really using the value.

159
00:07:05,270 --> 00:07:08,220
But I cannot write it until
I have read that value.

160
00:07:08,220 --> 00:07:10,960
Because the minute I write it,
I lose the previous value.

161
00:07:10,960 --> 00:07:14,270
And if I haven't used
it, I'm out of luck.

162
00:07:14,270 --> 00:07:18,070
So there might be a case that
I have a register, that I'm

163
00:07:18,070 --> 00:07:20,740
reading the register and
rewriting it some new value.

164
00:07:20,740 --> 00:07:22,990
But I have to wait till the
reading is done before I do

165
00:07:22,990 --> 00:07:24,240
this new write.

166
00:07:24,240 --> 00:07:26,180
And that's called
anti-dependence.

167
00:07:26,180 --> 00:07:30,900
So what that means is we have
to wait to run this

168
00:07:30,900 --> 00:07:33,410
instruction until this is
all -- you can't do

169
00:07:33,410 --> 00:07:36,960
it all before that.

170
00:07:36,960 --> 00:07:41,230
So this is called a Write After
Read, as I said, in the

171
00:07:41,230 --> 00:07:42,460
architecture jargon.

172
00:07:42,460 --> 00:07:44,850
The other dependences have
what you call output

173
00:07:44,850 --> 00:07:46,470
dependence.

174
00:07:46,470 --> 00:07:50,550
Two guys are writing
the register, and

175
00:07:50,550 --> 00:07:51,720
then I'm reading it.

176
00:07:51,720 --> 00:07:55,710
So I want to read the value
the last guy wrote.

177
00:07:55,710 --> 00:07:59,080
So if I reorder that,
I get a wrong value.

178
00:07:59,080 --> 00:08:01,020
Actually you can even
do better in here.

179
00:08:01,020 --> 00:08:02,806
How can you do better in here?

180
00:08:02,806 --> 00:08:03,640
AUDIENCE: You can eliminate I.

181
00:08:03,640 --> 00:08:03,812
PROFESSOR: Yeah.

182
00:08:03,812 --> 00:08:05,580
You can elimiate the first one,
because nobody's using

183
00:08:05,580 --> 00:08:06,330
that value.

184
00:08:06,330 --> 00:08:09,740
So you can go even further
and further, but

185
00:08:09,740 --> 00:08:10,680
this is also a hazard.

186
00:08:10,680 --> 00:08:15,730
This is called a Write
After Write hazard.

187
00:08:15,730 --> 00:08:20,260
And the interesting thing is
by doing what you call

188
00:08:20,260 --> 00:08:23,650
register renaming, you can
eliminate these things.

189
00:08:23,650 --> 00:08:26,420
So why do both have to use
the same register?

190
00:08:26,420 --> 00:08:29,050
In these two, if I use a
different register I don't

191
00:08:29,050 --> 00:08:30,770
have that dependency.

192
00:08:30,770 --> 00:08:35,720
And so a lot of times in
software, and also in modern

193
00:08:35,720 --> 00:08:39,200
superscalar hardware, there's
this huge amount of hardware

194
00:08:39,200 --> 00:08:41,650
resources that actually
do register renaming.

195
00:08:41,650 --> 00:08:43,400
So they realized that
anti-dependence is output

196
00:08:43,400 --> 00:08:44,220
dependent, and said
-- "Wait minute.

197
00:08:44,220 --> 00:08:45,280
Why do I even have to do that?

198
00:08:45,280 --> 00:08:47,450
I can use a different
register." So even

199
00:08:47,450 --> 00:08:49,130
though you have --

200
00:08:49,130 --> 00:08:52,260
Intel basically [UNINTELLIGIBLE]

201
00:08:52,260 --> 00:08:54,340
accessory only have
eight registers.

202
00:08:54,340 --> 00:08:56,060
They are about 100
registers behind.

203
00:08:56,060 --> 00:08:58,190
Hardware registers just
basically let you do this

204
00:08:58,190 --> 00:09:01,120
reordering and renaming
-- register renaming.

205
00:09:03,670 --> 00:09:05,660
So the other type of depencence
is control

206
00:09:05,660 --> 00:09:07,170
dependence.

207
00:09:07,170 --> 00:09:11,150
So what that means is if you
have a program like this, you

208
00:09:11,150 --> 00:09:13,630
have to preserve the
program ordering.

209
00:09:13,630 --> 00:09:19,300
And what that means is S1 is
control dependent on p1.

210
00:09:19,300 --> 00:09:22,475
Because depending on what p1 is,
it will depend on this one

211
00:09:22,475 --> 00:09:23,870
to get excuted.

212
00:09:23,870 --> 00:09:27,370
S2 is control dependent
on p2, but not p1.

213
00:09:27,370 --> 00:09:32,550
So it doesn't matter what p1
does, S2 will execute only if

214
00:09:32,550 --> 00:09:34,800
p2 is true.

215
00:09:34,800 --> 00:09:36,260
So there's a control dependence
in there.

216
00:09:39,880 --> 00:09:42,900
Another interesting thing is
control dependence may -- you

217
00:09:42,900 --> 00:09:45,190
don't need to preserve
it all the time.

218
00:09:45,190 --> 00:09:48,250
You might be able to do things
out of this order.

219
00:09:48,250 --> 00:09:51,050
Basically, what you can do is if
you are willing to do more

220
00:09:51,050 --> 00:09:53,440
work, you can say -- "Well,
I will do this.

221
00:09:53,440 --> 00:09:55,590
I don't know that I really need
it, because I don't know

222
00:09:55,590 --> 00:09:56,950
whether the p2 is true or not.

223
00:09:56,950 --> 00:09:58,210
But I'll just keep doing it.

224
00:09:58,210 --> 00:10:02,800
And then if I really wanted,
I'll actually have the results

225
00:10:02,800 --> 00:10:07,170
ready for me." And that's called
speculative execution.

226
00:10:07,170 --> 00:10:08,550
So you can do speculation.

227
00:10:08,550 --> 00:10:10,220
You speculatively think
that you will need

228
00:10:10,220 --> 00:10:11,470
something, and go do it.

229
00:10:14,320 --> 00:10:18,000
Speculation provides you with
a lot of increased ILP,

230
00:10:18,000 --> 00:10:21,320
because it can overcome
control dependence by

231
00:10:21,320 --> 00:10:24,620
executing through branches,
before even you know where the

232
00:10:24,620 --> 00:10:25,700
branch is going.

233
00:10:25,700 --> 00:10:28,120
And a lot of times you can go
through both directions, and

234
00:10:28,120 --> 00:10:29,900
say -- "Wait a minute, I don't
know which way I'm going.

235
00:10:29,900 --> 00:10:33,210
I'll do both sides." And I know
at least one side you are

236
00:10:33,210 --> 00:10:34,230
going, and that will
be useful.

237
00:10:34,230 --> 00:10:37,090
And you can go more and more,
and soon you see that you are

238
00:10:37,090 --> 00:10:39,170
doing so much more work than
actually will be useful.

239
00:10:41,890 --> 00:10:45,710
So the first level of
speculation is -- speculation

240
00:10:45,710 --> 00:10:48,780
basically says, you go, you
fetch, issue, and execute

241
00:10:48,780 --> 00:10:49,240
everything.

242
00:10:49,240 --> 00:10:52,060
You do the end of the thing
without just committing your

243
00:10:52,060 --> 00:10:55,160
weight into the commit to make
sure that the right thing

244
00:10:55,160 --> 00:10:56,000
actually happened.

245
00:10:56,000 --> 00:10:58,800
So this is the full
speculation.

246
00:10:58,800 --> 00:11:02,140
There's a little bit of less
speculation called dynamic

247
00:11:02,140 --> 00:11:02,580
scheduling.

248
00:11:02,580 --> 00:11:04,760
If you look at a microprocessor,
one of the

249
00:11:04,760 --> 00:11:09,120
biggest problems is the pipeline
stall is a branch.

250
00:11:09,120 --> 00:11:12,430
You can't keep even a pipeline
going, even in a single-issue

251
00:11:12,430 --> 00:11:14,520
machine, if there's a branch,
because the branch condition

252
00:11:14,520 --> 00:11:15,470
gets resolved.

253
00:11:15,470 --> 00:11:18,750
Not after the next instruction
has to get fetched.

254
00:11:18,750 --> 00:11:21,100
So if you do a normal thing,
you just have to

255
00:11:21,100 --> 00:11:22,870
reinstall the pipeline.

256
00:11:22,870 --> 00:11:29,800
So what dynamic scheduling or
a branch predictor sometimes

257
00:11:29,800 --> 00:11:31,880
does is, it will say I
will predict where

258
00:11:31,880 --> 00:11:33,660
the branch is going.

259
00:11:33,660 --> 00:11:35,730
So I might not have fed board
direction, but I will

260
00:11:35,730 --> 00:11:38,340
speculatively go fetch down
one path, because it looks

261
00:11:38,340 --> 00:11:39,620
like it which it's going.

262
00:11:39,620 --> 00:11:42,890
For many times, like for example
in a loop, 99% of the

263
00:11:42,890 --> 00:11:44,935
time you are going in the
backage, because you don't go

264
00:11:44,935 --> 00:11:45,450
through that.

265
00:11:45,450 --> 00:11:46,750
And then if you predict
that you are mostly

266
00:11:46,750 --> 00:11:47,580
[UNINTELLIGIBLE].

267
00:11:47,580 --> 00:11:49,730
So the branch predictors are
pretty good at finding these

268
00:11:49,730 --> 00:11:50,870
kind of cases.

269
00:11:50,870 --> 00:11:53,710
There are very few branches
that are kind of 50-50.

270
00:11:53,710 --> 00:11:56,260
Most branches have
a preferred path.

271
00:11:56,260 --> 00:11:58,780
If you find the preferred path
you can go through that, and

272
00:11:58,780 --> 00:12:00,200
you don't pay any penalty.

273
00:12:00,200 --> 00:12:01,860
The penalty is if you made a
mistake, you had to kind of

274
00:12:01,860 --> 00:12:03,450
back up a few times.

275
00:12:03,450 --> 00:12:05,490
So you can at least do
in one direction.

276
00:12:05,490 --> 00:12:08,240
Most hardware do that, even the
simplest things do that.

277
00:12:08,240 --> 00:12:10,550
But if you do good speculation
you go both.

278
00:12:10,550 --> 00:12:13,150
You say -- "Eh, there's a chance
if I go down that path

279
00:12:13,150 --> 00:12:13,900
I'm going to lose a lot.

280
00:12:13,900 --> 00:12:18,920
So I'll do that, too." So that
does a lot of expensive stuff.

281
00:12:18,920 --> 00:12:23,080
And basically this is more
for data flow model.

282
00:12:23,080 --> 00:12:26,160
So as soon as data get available
you don't think too

283
00:12:26,160 --> 00:12:30,150
much about control, you
keep firing that.

284
00:12:30,150 --> 00:12:36,780
So today's superscalar
processors have huge amount of

285
00:12:36,780 --> 00:12:37,460
speculation.

286
00:12:37,460 --> 00:12:39,290
You speculate on everything.

287
00:12:39,290 --> 00:12:40,170
Branch prediction.

288
00:12:40,170 --> 00:12:42,690
You assume all the branches,
multilevel down you predict,

289
00:12:42,690 --> 00:12:43,470
and go that.

290
00:12:43,470 --> 00:12:44,360
Value prediction.

291
00:12:44,360 --> 00:12:45,960
You look at it and say -- "Hey,
I think it's going to be

292
00:12:45,960 --> 00:12:50,450
two." And in fact there's a
paper that says about 80% of

293
00:12:50,450 --> 00:12:51,700
program values are zero.

294
00:12:55,130 --> 00:12:56,060
And then you say -- "OK.

295
00:12:56,060 --> 00:12:57,510
I'll think it's zero,
and it'll go on.

296
00:12:57,510 --> 00:12:59,530
And if it is not zero, I'll
have to come back and do

297
00:12:59,530 --> 00:13:00,610
that." So things like that.

298
00:13:00,610 --> 00:13:02,437
AUDIENCE: Do you know what
percentage of the time it has

299
00:13:02,437 --> 00:13:03,870
to go back?

300
00:13:03,870 --> 00:13:08,350
PROFESSOR: A lot of times I
think it is probably an 80-20

301
00:13:08,350 --> 00:13:11,420
type thing, but if you do too
much you're always backing up.

302
00:13:11,420 --> 00:13:13,310
But you can at least do
a few things down

303
00:13:13,310 --> 00:13:14,650
assuming it's zero.

304
00:13:14,650 --> 00:13:16,260
So things like that.

305
00:13:16,260 --> 00:13:21,530
People, try to take advantage
of the statistical nature of

306
00:13:21,530 --> 00:13:24,690
programs. And you are
mining every day.

307
00:13:24,690 --> 00:13:29,160
So basically there's no --

308
00:13:29,160 --> 00:13:30,420
it's almost at the entropy.

309
00:13:30,420 --> 00:13:33,030
So every information is kind
of taken advantage in the

310
00:13:33,030 --> 00:13:37,370
program, but what that means
is you are wasting a lot of

311
00:13:37,370 --> 00:13:38,470
time cycles.

312
00:13:38,470 --> 00:13:40,740
So the conventional
wisdom was --

313
00:13:40,740 --> 00:13:42,610
"You have Moore's slope.

314
00:13:42,610 --> 00:13:43,920
You keep getting these
transistors.

315
00:13:43,920 --> 00:13:47,680
There's nothing to do with it,
so let me do more other work.

316
00:13:47,680 --> 00:13:50,080
We'll predicate, we'll do
additional work, we'll go

317
00:13:50,080 --> 00:13:52,560
through multipe branches, we'll
assume things are zero.

318
00:13:52,560 --> 00:13:54,110
Because what's wasted?

319
00:13:54,110 --> 00:13:57,580
Because it's extra work, if it
is wrong we just give it up."

320
00:13:57,580 --> 00:14:00,380
So that's the way it went, and
the thing is it's very

321
00:14:00,380 --> 00:14:00,895
inefficient.

322
00:14:00,895 --> 00:14:03,900
Because a lot of times you are
doing -- think about even a

323
00:14:03,900 --> 00:14:04,960
simple cache.

324
00:14:04,960 --> 00:14:07,580
If you have 4-way as a cache.

325
00:14:07,580 --> 00:14:09,700
Every cycle when you're doing
a memory fetch, you are

326
00:14:09,700 --> 00:14:14,140
fetching on all four, assuming
one of it will have hit.

327
00:14:14,140 --> 00:14:17,480
Even if you have a cache hit
where only one bank is hit,

328
00:14:17,480 --> 00:14:19,350
and all the other three
banks are not hit.

329
00:14:19,350 --> 00:14:21,750
So you are just doing a
lot more extra work

330
00:14:21,750 --> 00:14:23,340
just to get one thing.

331
00:14:23,340 --> 00:14:26,580
Of course because if you wait to
figure out which bank, it's

332
00:14:26,580 --> 00:14:28,000
going to add a little
bit more delay.

333
00:14:28,000 --> 00:14:28,710
So you won't do it parallelly.

334
00:14:28,710 --> 00:14:30,790
You know that's it's going to
be one of the lines, so you

335
00:14:30,790 --> 00:14:32,900
just go fetch everything
and then later decide

336
00:14:32,900 --> 00:14:33,840
which one you want.

337
00:14:33,840 --> 00:14:38,390
So things like that really
waste energy.

338
00:14:38,390 --> 00:14:41,560
And what has been happening in
the last 10 years is you

339
00:14:41,560 --> 00:14:44,320
double the amount of
transistors, and you add 5%

340
00:14:44,320 --> 00:14:46,060
more performance gain.

341
00:14:46,060 --> 00:14:49,470
Because statistically you have
mined most of the lower

342
00:14:49,470 --> 00:14:51,260
hanging fruit, there's
nothing much left.

343
00:14:51,260 --> 00:14:53,800
So you're getting to a point
that has a little bit of a

344
00:14:53,800 --> 00:14:56,280
statistical significance,
and you go after that.

345
00:14:56,280 --> 00:14:59,200
So of course, most of
the time it's wrong.

346
00:14:59,200 --> 00:15:03,060
So this leads to this chart that
actually yesterday I also

347
00:15:03,060 --> 00:15:03,730
pointed out.

348
00:15:03,730 --> 00:15:06,220
So you are going from hot plate
to nuclear reactor, to

349
00:15:06,220 --> 00:15:08,790
rocket nozzle.

350
00:15:08,790 --> 00:15:10,400
We tend to be going
in that direction.

351
00:15:10,400 --> 00:15:12,390
That is the path, because we
are just doing all these

352
00:15:12,390 --> 00:15:14,450
wasteful things.

353
00:15:14,450 --> 00:15:18,230
And right now, the power
consumption on processors is

354
00:15:18,230 --> 00:15:21,420
significant enough in both
things like laptops --

355
00:15:21,420 --> 00:15:24,110
because the battery's not
getting faster -- as well as

356
00:15:24,110 --> 00:15:25,220
things like Google.

357
00:15:25,220 --> 00:15:28,360
So doing this extra
useless work is

358
00:15:28,360 --> 00:15:29,610
actually starting to impact.

359
00:15:32,670 --> 00:15:34,980
So for example, if you look
at something like Pentium.

360
00:15:34,980 --> 00:15:40,310
You have 11 stages
of instructions.

361
00:15:40,310 --> 00:15:45,350
You can execute 3 x86
instructions per cycle.

362
00:15:45,350 --> 00:15:49,770
So you're doing this huge
superscalar thing, but

363
00:15:49,770 --> 00:15:52,750
something that had been creeping
in lately is also

364
00:15:52,750 --> 00:15:55,700
some amount of explicit
parallelism.

365
00:15:55,700 --> 00:15:58,780
So they introduced things like
MMX and SSE instructions.

366
00:15:58,780 --> 00:16:01,280
They are explicit parallelism,
visible to the user.

367
00:16:01,280 --> 00:16:03,670
So it's not hiding trying
to get parallelism.

368
00:16:03,670 --> 00:16:06,980
So we have been slowly moving to
this kind of model, saying

369
00:16:06,980 --> 00:16:09,670
if you want performance you have
to do something manual.

370
00:16:09,670 --> 00:16:11,580
So people who really cared
about performance had

371
00:16:11,580 --> 00:16:12,490
to deal with that.

372
00:16:12,490 --> 00:16:17,450
And of course, we put multiple
chips together to build a

373
00:16:17,450 --> 00:16:19,250
multiprocessor --

374
00:16:19,250 --> 00:16:22,120
it's not in a single chip --
that actually do parallel

375
00:16:22,120 --> 00:16:22,800
processing.

376
00:16:22,800 --> 00:16:28,270
So for about three, four years
if you buy a workstation it

377
00:16:28,270 --> 00:16:30,320
had two processors
sitting in there.

378
00:16:30,320 --> 00:16:32,650
So dual processor, quad
processor machines came about,

379
00:16:32,650 --> 00:16:33,820
and people started using that.

380
00:16:33,820 --> 00:16:37,240
So it's not like we are doing
this shift abruptly, we have

381
00:16:37,240 --> 00:16:39,770
been going that direction.

382
00:16:39,770 --> 00:16:41,880
For people who really cared
about performance, actually

383
00:16:41,880 --> 00:16:43,770
had to deal with that and were
actually using that.

384
00:16:46,960 --> 00:16:47,580
OK.

385
00:16:47,580 --> 00:16:49,380
So let's switch gears a little
bit and do explicit

386
00:16:49,380 --> 00:16:50,220
parallelism.

387
00:16:50,220 --> 00:16:51,980
So this is kind of
where we are --

388
00:16:51,980 --> 00:16:55,500
where we are today, where
we are switching.

389
00:16:55,500 --> 00:17:00,740
So basically, these are the
machines that parallelism

390
00:17:00,740 --> 00:17:02,410
exposed to software --
either compiler.

391
00:17:02,410 --> 00:17:05,890
So you might not see it as a
user, but it exposes some

392
00:17:05,890 --> 00:17:07,020
layer of software.

393
00:17:07,020 --> 00:17:09,210
And there are many different
forms of it.

394
00:17:09,210 --> 00:17:15,110
From very loosely coupled
multiprocessors sitting on a

395
00:17:15,110 --> 00:17:19,610
board, or even sitting on
multipe machines -- things

396
00:17:19,610 --> 00:17:22,460
like a cluster of
workstations --

397
00:17:22,460 --> 00:17:24,030
to very tightly coupled
machines.

398
00:17:24,030 --> 00:17:26,290
So we'll go through, and figure
out what are all the

399
00:17:26,290 --> 00:17:27,590
flavors of these things.

400
00:17:27,590 --> 00:17:28,625
AUDIENCE: Excuse me.

401
00:17:28,625 --> 00:17:29,142
PROFESSOR: Mhmm?

402
00:17:29,142 --> 00:17:31,830
AUDIENCE: So does it mean that
since there being the level

403
00:17:31,830 --> 00:17:35,900
parallelism, the processor can
exploit the fact that the

404
00:17:35,900 --> 00:17:37,740
compiler knows the higher
level instructions?

405
00:17:37,740 --> 00:17:38,950
Does that make any difference?

406
00:17:38,950 --> 00:17:40,410
PROFESSOR: It goes both ways.

407
00:17:40,410 --> 00:17:45,730
So what the processor knows is
it know values for everything.

408
00:17:45,730 --> 00:17:49,200
So it has full exact knowledge
of what's going on.

409
00:17:49,200 --> 00:17:51,620
Compiler is an abstraction.

410
00:17:51,620 --> 00:17:54,200
In that sense, processor
wins in those.

411
00:17:54,200 --> 00:17:56,730
On the other hand, compile
time doesn't

412
00:17:56,730 --> 00:17:58,160
affect the run time.

413
00:17:58,160 --> 00:18:00,670
So the compiler has a much
bigger view of the world.

414
00:18:03,440 --> 00:18:05,750
Even the most aggressive
processor can't look ahead

415
00:18:05,750 --> 00:18:07,940
more than 100 instructions.

416
00:18:07,940 --> 00:18:09,760
On the other hand, the compiler
sees ahead of

417
00:18:09,760 --> 00:18:11,600
millions of instructions.

418
00:18:11,600 --> 00:18:14,280
And so the compiler has the
ability to kind of get the big

419
00:18:14,280 --> 00:18:16,960
picture and do things --
global kind of things.

420
00:18:16,960 --> 00:18:19,360
But on the other hand, it loses
out when it doesn't have

421
00:18:19,360 --> 00:18:20,650
information.

422
00:18:20,650 --> 00:18:23,130
Whereas when you do the hardware
parallelism, you have

423
00:18:23,130 --> 00:18:23,870
full information.

424
00:18:23,870 --> 00:18:24,840
AUDIENCE: You don't have
to give up one at

425
00:18:24,840 --> 00:18:27,290
the loss of the other.

426
00:18:27,290 --> 00:18:29,490
PROFESSOR: The thing is, I don't
think we have a good way

427
00:18:29,490 --> 00:18:31,540
of combining both very well.

428
00:18:31,540 --> 00:18:34,350
Because the thing is, sometimes
global optimization

429
00:18:34,350 --> 00:18:36,140
needs local information,
and that's not

430
00:18:36,140 --> 00:18:38,150
available at run time.

431
00:18:38,150 --> 00:18:40,396
And global optimization is very
costly, so you can't say

432
00:18:40,396 --> 00:18:43,860
-- "OK, I'm going to do it any
time." So I think it's kind of

433
00:18:43,860 --> 00:18:45,720
even hybrid things.

434
00:18:45,720 --> 00:18:47,800
There's no nice mesh in there.

435
00:18:51,960 --> 00:18:55,540
So if you think a little bit
about parallelism, one

436
00:18:55,540 --> 00:18:58,100
interesting thing is
this Little's Law.

437
00:18:58,100 --> 00:19:05,500
Little's Law says parallelism
is a multiplication of

438
00:19:05,500 --> 00:19:07,020
throughput vs. latency.

439
00:19:09,840 --> 00:19:14,980
So the way to think about that
is the parallelism is dictated

440
00:19:14,980 --> 00:19:16,500
by the program in some sense.

441
00:19:16,500 --> 00:19:19,610
The program has a certain
amount of parallelism.

442
00:19:19,610 --> 00:19:22,735
So if you have a thing that has
a lot of latency to get to

443
00:19:22,735 --> 00:19:27,870
the result, what that means is
there's a certain amount of

444
00:19:27,870 --> 00:19:30,850
throughput you can satisfy.

445
00:19:30,850 --> 00:19:34,380
Whereas if you have a thing that
has a very low latency

446
00:19:34,380 --> 00:19:37,910
operation, you can
go much wider.

447
00:19:37,910 --> 00:19:40,460
So if you look at Intel
processors, what they have

448
00:19:40,460 --> 00:19:42,450
done is the superscalars --

449
00:19:42,450 --> 00:19:45,320
they have actually, to get
things faster they have a very

450
00:19:45,320 --> 00:19:46,380
long latency.

451
00:19:46,380 --> 00:19:48,210
Because they know they
couldn't go more than

452
00:19:48,210 --> 00:19:49,980
three or four wide.

453
00:19:49,980 --> 00:19:52,630
So they went like 55 the
pipeline, three wide.

454
00:19:55,130 --> 00:19:58,140
Because you can go fast,
so they assume the

455
00:19:58,140 --> 00:19:59,400
parallelism fits here.

456
00:19:59,400 --> 00:20:00,510
So still you need a lot
of parallelism.

457
00:20:00,510 --> 00:20:01,870
So you say -- "Three, why
should [UNINTELLIGIBLE]

458
00:20:01,870 --> 00:20:02,940
issue machine.

459
00:20:02,940 --> 00:20:05,600
[UNINTELLIGIBLE] three it's no
big deal." But no, if you have

460
00:20:05,600 --> 00:20:12,210
55 the pipeline you need to have
165 parallel instructions

461
00:20:12,210 --> 00:20:14,560
on the fly any given time.

462
00:20:14,560 --> 00:20:15,410
So that's the thing.

463
00:20:15,410 --> 00:20:17,960
Even in the moder machine,
there's about hundreds of

464
00:20:17,960 --> 00:20:19,180
instruction on the
fly, because the

465
00:20:19,180 --> 00:20:22,230
pipeline is so large.

466
00:20:22,230 --> 00:20:24,250
So if you said 3-issue,
it's not that.

467
00:20:24,250 --> 00:20:25,890
I mean this happens in there.

468
00:20:25,890 --> 00:20:29,280
So this gives designers a lot
of flexibiilty in where you

469
00:20:29,280 --> 00:20:30,540
are expanding.

470
00:20:30,540 --> 00:20:34,380
And in some ways you
can have a lot --

471
00:20:34,380 --> 00:20:36,930
there are some machines that
are a lot wider, but the

472
00:20:36,930 --> 00:20:38,070
latency is --

473
00:20:38,070 --> 00:20:41,020
For example, if you look
at an Itanium.

474
00:20:41,020 --> 00:20:46,290
It's clock cycle is about half
the time of the Pentium,

475
00:20:46,290 --> 00:20:51,160
because it has a lot less
latency but a lot wider.

476
00:20:51,160 --> 00:20:52,580
So you can do these
kind of tradeoffs.

477
00:20:55,240 --> 00:20:57,690
Types of parallelism.

478
00:20:57,690 --> 00:21:00,750
There are four categorizations
here.

479
00:21:00,750 --> 00:21:03,800
So one categorization is,
you have pipeline.

480
00:21:03,800 --> 00:21:07,620
You do the same thing in
a pipeline fashion.

481
00:21:07,620 --> 00:21:09,310
So you do the same
instruction.

482
00:21:09,310 --> 00:21:12,450
You do a little bit, and you
start another copy of another

483
00:21:12,450 --> 00:21:13,250
copy of another copy.

484
00:21:13,250 --> 00:21:15,710
So you kind of pipeline the
same thing down here.

485
00:21:15,710 --> 00:21:17,550
Kind of a vector machine --
we'll go through categories

486
00:21:17,550 --> 00:21:19,840
that kind of fit in here.

487
00:21:19,840 --> 00:21:22,390
Another category is data-level
parallelism.

488
00:21:22,390 --> 00:21:29,130
What that means is, in a given
cycle you do the same thing

489
00:21:29,130 --> 00:21:30,980
many many many many --

490
00:21:30,980 --> 00:21:33,610
the same instructions for
many many things.

491
00:21:33,610 --> 00:21:35,620
And then next cycle
you do something

492
00:21:35,620 --> 00:21:37,133
different, stuff like that.

493
00:21:37,133 --> 00:21:39,320
Thread-level parallelism breaks
in the other way.

494
00:21:39,320 --> 00:21:41,360
Thread-level parallelism
says --

495
00:21:41,360 --> 00:21:43,980
"I am not connecting the cycles,
they are independent.

496
00:21:43,980 --> 00:21:48,590
Each thread can go do something
different."

497
00:21:48,590 --> 00:21:50,470
And instruction-level
parallelism is kind of a

498
00:21:50,470 --> 00:21:51,280
combination.

499
00:21:51,280 --> 00:21:54,865
What you are doing is, you are
doing cycle by cycle -- they

500
00:21:54,865 --> 00:21:57,820
are connected -- and each cycle
you do some kind of a

501
00:21:57,820 --> 00:21:59,320
combination of operations.

502
00:21:59,320 --> 00:22:01,170
So if you look at
this closely.

503
00:22:01,170 --> 00:22:03,090
So pipelining hits here.

504
00:22:03,090 --> 00:22:05,590
Data parallel execution,
things like SIMD

505
00:22:05,590 --> 00:22:06,870
execution hits here.

506
00:22:06,870 --> 00:22:08,110
Thread-level parallelism.

507
00:22:08,110 --> 00:22:09,520
Instruction-level parallelism.

508
00:22:09,520 --> 00:22:11,530
So before a models of
parallelism, what software

509
00:22:11,530 --> 00:22:18,390
people see kind of fits also in
this architecture picture.

510
00:22:18,390 --> 00:22:21,440
So when you are designing a
parallel machine, what do you

511
00:22:21,440 --> 00:22:22,800
have to worry about?

512
00:22:22,800 --> 00:22:24,700
The first thing is
communication.

513
00:22:24,700 --> 00:22:26,060
That's the begin --

514
00:22:26,060 --> 00:22:27,140
the problem in here.

515
00:22:27,140 --> 00:22:30,930
How do parallel operations
communicate the data results?

516
00:22:30,930 --> 00:22:33,490
Because it's not only an
issue of bandwith,

517
00:22:33,490 --> 00:22:35,550
it's an issue of latency.

518
00:22:35,550 --> 00:22:38,300
The thing about bandwith is that
had been increasing by

519
00:22:38,300 --> 00:22:38,990
Moore's Law.

520
00:22:38,990 --> 00:22:40,600
Latency, speed of light.

521
00:22:40,600 --> 00:22:42,770
So as I pointed out, there's
no Moore's Law on speed of

522
00:22:42,770 --> 00:22:46,540
light, and you have
to deal with that.

523
00:22:46,540 --> 00:22:47,650
Synchronization.

524
00:22:47,650 --> 00:22:50,510
So when people do different
things, how do you synchronize

525
00:22:50,510 --> 00:22:50,990
at some point?

526
00:22:50,990 --> 00:22:53,550
Because you can't keep going
on different paths, at some

527
00:22:53,550 --> 00:22:54,680
point you have to
come together.

528
00:22:54,680 --> 00:22:56,270
What's the cost?

529
00:22:56,270 --> 00:22:57,700
What are the processes
of going it?

530
00:22:57,700 --> 00:23:01,670
Some stuff it's very
explicit --

531
00:23:01,670 --> 00:23:03,160
you have to deal with that.

532
00:23:03,160 --> 00:23:06,680
Some machines it's built in,
so every cycle you are

533
00:23:06,680 --> 00:23:08,840
synchronizing.

534
00:23:08,840 --> 00:23:14,300
So sometimes it makes it easier
for you, sometimes it

535
00:23:14,300 --> 00:23:15,940
makes it more inefficient.

536
00:23:15,940 --> 00:23:18,500
So you have to figure
out what is in here.

537
00:23:18,500 --> 00:23:20,932
Resource management.

538
00:23:20,932 --> 00:23:23,920
The thing about parallelism is
you have a lot of things going

539
00:23:23,920 --> 00:23:28,480
on, and managing that is
a very important issue.

540
00:23:28,480 --> 00:23:33,970
Because sometimes if you put
things in the wrong place, the

541
00:23:33,970 --> 00:23:36,260
cost of doing that might
be much higher.

542
00:23:36,260 --> 00:23:40,890
That really reduces the
benefit of doing that.

543
00:23:40,890 --> 00:23:43,010
And finally the scalability.

544
00:23:43,010 --> 00:23:48,070
How do you build process
that not only can do 2x

545
00:23:48,070 --> 00:23:50,110
parallelism, but can
do thousand?

546
00:23:50,110 --> 00:23:52,750
How can you keep growing
with Moore's Law.

547
00:23:52,750 --> 00:23:55,960
So there are some ways you can
get really good numbers, small

548
00:23:55,960 --> 00:23:58,240
numbers, but as you go bigger
and bigger you can't scale.

549
00:24:01,880 --> 00:24:02,850
So here's a classic

550
00:24:02,850 --> 00:24:04,340
classification of parallel machines.

551
00:24:04,340 --> 00:24:07,610
This has been [? divided ?]
up by Mike Flynn in 1966.

552
00:24:07,610 --> 00:24:10,310
So he came up with four ways
of classifying a machine.

553
00:24:10,310 --> 00:24:12,560
First he looked at how

554
00:24:12,560 --> 00:24:15,100
instruction and data is issued.

555
00:24:15,100 --> 00:24:18,240
So one thing is single
instruction, single data.

556
00:24:18,240 --> 00:24:21,080
So there's single instruction
given each cycle, and it

557
00:24:21,080 --> 00:24:22,040
affects single data.

558
00:24:22,040 --> 00:24:25,560
This is your conventional
uniprocessor.

559
00:24:25,560 --> 00:24:28,360
Then came a SIMD machine
-- single

560
00:24:28,360 --> 00:24:30,160
instruction, multiple data.

561
00:24:30,160 --> 00:24:32,520
So what that means is the given
instruction affects

562
00:24:32,520 --> 00:24:34,700
multiple data in here.

563
00:24:34,700 --> 00:24:38,640
So things like -- there are two
types, distributed memory

564
00:24:38,640 --> 00:24:39,390
and shared memory.

565
00:24:39,390 --> 00:24:41,270
I'll go to this distinction
later.

566
00:24:41,270 --> 00:24:43,120
So there are a bunch
of machines.

567
00:24:43,120 --> 00:24:46,010
In the good old times this was
a useful trick, because the

568
00:24:46,010 --> 00:24:48,620
sequencer -- or what ran the
instructions -- was a pretty

569
00:24:48,620 --> 00:24:50,930
substantial piece of hardware.

570
00:24:50,930 --> 00:24:54,780
So you build one of them and
use it for many, many data.

571
00:24:54,780 --> 00:24:57,030
Even today data in a Pentium
if you are doing a SIMD

572
00:24:57,030 --> 00:24:59,640
instruction, you just issue one
instruction, it affects

573
00:24:59,640 --> 00:25:04,400
multiple data, and you
can get a nice reuse

574
00:25:04,400 --> 00:25:06,190
of instruction decoding.

575
00:25:06,190 --> 00:25:10,920
Reduce the instruction bandwidth
by doing SIMD.

576
00:25:10,920 --> 00:25:13,600
Then you go to MIMD,
which is Multiple

577
00:25:13,600 --> 00:25:14,920
Instruction, Multiple Data.

578
00:25:14,920 --> 00:25:17,600
So we have multiple instruction
streams each

579
00:25:17,600 --> 00:25:20,510
affecting its own data.

580
00:25:20,510 --> 00:25:23,240
So each data streams,
instruction streams

581
00:25:23,240 --> 00:25:23,820
separately.

582
00:25:23,820 --> 00:25:27,250
So things like message passing
machines, coherent and

583
00:25:27,250 --> 00:25:28,390
non-coherent shared memory.

584
00:25:28,390 --> 00:25:30,060
I'll go into details
of coherence and

585
00:25:30,060 --> 00:25:31,180
non-coherence later.

586
00:25:31,180 --> 00:25:35,060
There are multiple categories
within that too.

587
00:25:35,060 --> 00:25:38,090
And then finally, there's kind
of a misnomer, MISD.

588
00:25:38,090 --> 00:25:39,520
There hasn't been a
single machine.

589
00:25:39,520 --> 00:25:41,595
It doesn't make sense to have
multiple instructions work on

590
00:25:41,595 --> 00:25:42,400
single data.

591
00:25:42,400 --> 00:25:46,000
So this classification,
right now -- question?

592
00:25:46,000 --> 00:25:49,070
AUDIENCE: I've heard
that [INAUDIBLE]

593
00:25:49,070 --> 00:25:51,040
PROFESSOR: Multiple instruction,
single data?

594
00:25:51,040 --> 00:25:53,140
I don't know.

595
00:25:53,140 --> 00:25:55,490
You can try to fit something
there just to have something,

596
00:25:55,490 --> 00:26:00,340
but it doesn't fit really well
into this kind of thinking.

597
00:26:00,340 --> 00:26:02,340
So I don't like that thinking.

598
00:26:02,340 --> 00:26:04,640
I was thinking how should I do
it, so I came up with a new

599
00:26:04,640 --> 00:26:05,830
way of classifying.

600
00:26:05,830 --> 00:26:09,390
So what my classification
is, what's the last

601
00:26:09,390 --> 00:26:10,350
thing you are sharing?

602
00:26:10,350 --> 00:26:13,440
Because when you are running
something, if it is some

603
00:26:13,440 --> 00:26:16,150
single machine, some thing has
to be shared, and some things

604
00:26:16,150 --> 00:26:17,170
have to be separated.

605
00:26:17,170 --> 00:26:19,830
So are you sharing instructions,
are you sharing

606
00:26:19,830 --> 00:26:22,740
the sequencer, are you sharing
the memory, are you sharing

607
00:26:22,740 --> 00:26:23,310
the network?

608
00:26:23,310 --> 00:26:27,290
So this kind of fits many things
nicely into this model.

609
00:26:27,290 --> 00:26:29,670
So let's go through
this model and see

610
00:26:29,670 --> 00:26:30,920
different things in there.

611
00:26:34,960 --> 00:26:38,630
So let's look at shared
instruction processors.

612
00:26:38,630 --> 00:26:43,130
So there had been a lot of work
in the good old days.

613
00:26:43,130 --> 00:26:48,290
Did anybody know Goodyear
actually made supercomputers?

614
00:26:48,290 --> 00:26:50,260
Not only did they make tires,
for a long time they were

615
00:26:50,260 --> 00:26:53,390
actually making processors.

616
00:26:53,390 --> 00:26:56,460
GE made processors,
stuff like that.

617
00:26:56,460 --> 00:27:00,550
And so a long time ago this
was a very interesting

618
00:27:00,550 --> 00:27:04,090
proposition, because there was a
huge amount of hardware that

619
00:27:04,090 --> 00:27:08,150
has to be dedicated to doing the
sequence and running the

620
00:27:08,150 --> 00:27:08,830
instruction.

621
00:27:08,830 --> 00:27:11,840
So just to share that was a
really interesting concept.

622
00:27:11,840 --> 00:27:14,470
So people built machines
that basically --

623
00:27:14,470 --> 00:27:17,090
single instruction
stream affecting

624
00:27:17,090 --> 00:27:18,340
multiple data in there.

625
00:27:18,340 --> 00:27:21,900
I think very well-known machines
are things like

626
00:27:21,900 --> 00:27:26,720
Thinking Machines CM-1,
Maspar MP-1 --

627
00:27:26,720 --> 00:27:31,100
which had 16,000 processors.

628
00:27:31,100 --> 00:27:32,310
Small processors --

629
00:27:32,310 --> 00:27:35,410
4-bit processors, you can only
do 4-bit computation.

630
00:27:35,410 --> 00:27:39,100
And then every cycle you can
do 16,000 of them, 4-bit

631
00:27:39,100 --> 00:27:40,640
things in here.

632
00:27:40,640 --> 00:27:43,250
It really fits in to the kind of
things they could build in

633
00:27:43,250 --> 00:27:45,430
hardware those days.

634
00:27:45,430 --> 00:27:47,400
And there's one controller
in there.

635
00:27:47,400 --> 00:27:49,230
So it is just a neat thing,
because you can do a lot of

636
00:27:49,230 --> 00:27:51,710
work if you actually can
match it in that form.

637
00:27:55,660 --> 00:27:58,760
So the way you look at that is,
to run this array you have

638
00:27:58,760 --> 00:28:00,570
this array controller.

639
00:28:00,570 --> 00:28:04,040
And then you have processing
elements, a

640
00:28:04,040 --> 00:28:04,750
huge amount of them.

641
00:28:04,750 --> 00:28:07,125
And you have each processor
mainly had distributed memory

642
00:28:07,125 --> 00:28:08,790
-- each has its own memory.

643
00:28:08,790 --> 00:28:12,150
And so given instruction,
everybody did the same thing

644
00:28:12,150 --> 00:28:15,310
to memory or arithmetic
in there.

645
00:28:15,310 --> 00:28:18,000
And then you had also
interconnect network, so you

646
00:28:18,000 --> 00:28:20,350
can actually send it.

647
00:28:20,350 --> 00:28:21,670
A lot of these things have
the [? near-enabled ?]

648
00:28:21,670 --> 00:28:22,580
communication.

649
00:28:22,580 --> 00:28:24,860
You can send data you near
enable, so everybody kind of

650
00:28:24,860 --> 00:28:29,900
shifts the 2-D or some kind
of torus mapping in there.

651
00:28:29,900 --> 00:28:33,240
And if you can program that,
you can get really good

652
00:28:33,240 --> 00:28:34,810
performance in there.

653
00:28:38,150 --> 00:28:39,880
And each cycle, it's
very synchronous.

654
00:28:39,880 --> 00:28:42,110
So each cycle everybody does
the same thing -- go to the

655
00:28:42,110 --> 00:28:43,360
next thing, do the same thing.

656
00:28:45,860 --> 00:28:49,840
So the next very interesting
machine is this Cray-1.

657
00:28:49,840 --> 00:28:51,710
I think this is one of
the first successful

658
00:28:51,710 --> 00:28:53,400
supercomputers out there.

659
00:28:53,400 --> 00:28:57,250
So here's the Cray-1, it is this
kind of round seat type

660
00:28:57,250 --> 00:28:59,340
thing sitting in here.

661
00:28:59,340 --> 00:29:01,914
Everybody know what was
under the seat?

662
00:29:01,914 --> 00:29:03,330
AUDIENCE: Cooling.

663
00:29:03,330 --> 00:29:04,030
PROFESSOR: Cooling.

664
00:29:04,030 --> 00:29:05,520
So here's a photo.

665
00:29:05,520 --> 00:29:08,120
I don't think you can see that
-- you can probably look at it

666
00:29:08,120 --> 00:29:09,700
when I put this on the
web -- was this

667
00:29:09,700 --> 00:29:10,880
entire cooling mechanism.

668
00:29:10,880 --> 00:29:14,350
In fact Seymour Cray at one
time said one of his most

669
00:29:14,350 --> 00:29:16,505
important innovations
in this machine is

670
00:29:16,505 --> 00:29:19,000
how to cool the thing.

671
00:29:19,000 --> 00:29:20,270
And this is a generation,
again, that

672
00:29:20,270 --> 00:29:22,420
power was a big thing.

673
00:29:22,420 --> 00:29:25,590
So each of these columns had
this huge amount of boards

674
00:29:25,590 --> 00:29:28,990
going, and in the middle had
all the wiring going.

675
00:29:28,990 --> 00:29:31,900
So we had this huge mess of
wiring in the middle --

676
00:29:31,900 --> 00:29:32,490
[UNINTELLIGIBLE]

677
00:29:32,490 --> 00:29:33,845
-- and then you had all
these boards in

678
00:29:33,845 --> 00:29:35,130
there in each of these.

679
00:29:35,130 --> 00:29:37,580
So this is the Cray-1
processor.

680
00:29:37,580 --> 00:29:39,190
AUDIENCE: Do you know
your little --

681
00:29:39,190 --> 00:29:43,630
your laptop is way faster
than that Cray --

682
00:29:43,630 --> 00:29:45,010
PROFESSOR: Yeah.

683
00:29:45,010 --> 00:29:48,655
Did you have the clock
speed in here?

684
00:29:48,655 --> 00:29:49,690
[INTERPOSING VOICES]

685
00:29:49,690 --> 00:29:51,930
AUDIENCE: 80 MHz.

686
00:29:51,930 --> 00:29:54,510
PROFESSOR: So, yeah.

687
00:29:54,510 --> 00:29:58,520
And that cost like $10 million
or something like

688
00:29:58,520 --> 00:30:01,470
that at that time.

689
00:30:01,470 --> 00:30:03,360
Moore's Law, it's
just amazing.

690
00:30:03,360 --> 00:30:05,640
If you think if you apply
Moore's Law to any other thing

691
00:30:05,640 --> 00:30:08,330
we have, it can't do
the comparison.

692
00:30:08,330 --> 00:30:11,290
We are very fortunate to be
part of that generation.

693
00:30:11,290 --> 00:30:13,040
AUDIENCE: But did it
have PowerPoint?

694
00:30:13,040 --> 00:30:16,580
PROFESSOR: So what it had,
was it had these

695
00:30:16,580 --> 00:30:17,690
three type of registers.

696
00:30:17,690 --> 00:30:19,550
It had scalar registers,
address

697
00:30:19,550 --> 00:30:21,470
registers, and vector registers.

698
00:30:21,470 --> 00:30:23,880
The key thing there is
the vector register.

699
00:30:23,880 --> 00:30:28,160
So if you want to do
things fast --

700
00:30:28,160 --> 00:30:29,250
no, fast is not the word.

701
00:30:29,250 --> 00:30:32,510
You can do a lot of computation
in a short amount

702
00:30:32,510 --> 00:30:35,840
of time by using the
vector registers.

703
00:30:35,840 --> 00:30:40,210
So the way to look at that is
normally when you go to the

704
00:30:40,210 --> 00:30:42,350
execute stage you
do one thing.

705
00:30:42,350 --> 00:30:44,670
In a vector register what
happens is it got pipelined.

706
00:30:44,670 --> 00:30:47,790
So execute state happened one
word next, next, next.

707
00:30:47,790 --> 00:30:51,660
You can do up to 64
or even bigger.

708
00:30:51,660 --> 00:30:53,380
I think it was 64 length,
length 64 things.

709
00:30:53,380 --> 00:30:55,360
So you can -- so that
instruction.

710
00:30:55,360 --> 00:30:58,640
So you do a few of these, and
then this state keeps going on

711
00:30:58,640 --> 00:31:00,590
and on and on, for 64.

712
00:31:00,590 --> 00:31:02,920
And then you can pipeline in
the way that you can start

713
00:31:02,920 --> 00:31:04,170
another one.

714
00:31:06,080 --> 00:31:08,220
Actually, this will use the same
executioner, so you have

715
00:31:08,220 --> 00:31:12,230
to wait till that finishes
to start.

716
00:31:12,230 --> 00:31:15,430
So you can pipeline to get a
huge amount of things going

717
00:31:15,430 --> 00:31:17,200
through the pipeline.

718
00:31:17,200 --> 00:31:20,750
And so each cycle you can
graduate many, many

719
00:31:20,750 --> 00:31:21,300
things going on.

720
00:31:21,300 --> 00:31:22,873
AUDIENCE: Can I ask you
a quick question?

721
00:31:22,873 --> 00:31:24,446
Something I'm trying to get
straight in my head.

722
00:31:24,446 --> 00:31:26,960
My notion -- and I don't think
I'm right on this, that's why

723
00:31:26,960 --> 00:31:31,305
I'm asking you -- is machines
like the Cray, I know you were

724
00:31:31,305 --> 00:31:34,285
talking about some of the vector
operations, those were

725
00:31:34,285 --> 00:31:36,980
by and large a relatively
small set of operations.

726
00:31:36,980 --> 00:31:39,840
Like dot products, and
vector time scalar.

727
00:31:39,840 --> 00:31:41,514
On the other hand, when you
look at the SIMD machines,

728
00:31:41,514 --> 00:31:43,860
they had a much richer
set of operations.

729
00:31:43,860 --> 00:31:49,110
PROFESSOR: I think with
scatter-gather and things like

730
00:31:49,110 --> 00:31:53,560
conditional execution, I think
vector machines could be a

731
00:31:53,560 --> 00:31:54,670
fairly large --

732
00:31:54,670 --> 00:31:58,298
I mean it's painful.

733
00:31:58,298 --> 00:32:01,230
AUDIENCE: [INAUDIBLE]

734
00:32:01,230 --> 00:32:03,660
PROFESSOR: The SIMD instruction
is Pentium.

735
00:32:03,660 --> 00:32:08,410
I think that is mainly targeting
single processing

736
00:32:08,410 --> 00:32:09,660
type stuff.

737
00:32:14,050 --> 00:32:15,260
They don't have real
scatter-gather.

738
00:32:15,260 --> 00:32:17,460
AUDIENCE: And the
Cell processor?

739
00:32:17,460 --> 00:32:20,370
PROFESSOR: Cell is distributed
memory.

740
00:32:20,370 --> 00:32:22,946
AUDIENCE: Yeah, but on one
the -- what do they

741
00:32:22,946 --> 00:32:23,490
call them, the --

742
00:32:23,490 --> 00:32:26,140
PROFESSOR: I don't think you
can scatter-gather either.

743
00:32:26,140 --> 00:32:31,260
It's just basically, you have
to align words in, word out.

744
00:32:31,260 --> 00:32:33,770
IBM is always about
doing align.

745
00:32:33,770 --> 00:32:37,030
So in even AltiVec, you can't
even do unaligned access.

746
00:32:37,030 --> 00:32:38,000
You had to do aligned access.

747
00:32:38,000 --> 00:32:40,795
So if there is no run align,
you had to pay a

748
00:32:40,795 --> 00:32:43,700
big penalty in there.

749
00:32:43,700 --> 00:32:46,140
So if you look at how
this happens.

750
00:32:46,140 --> 00:32:49,320
So you have this entire
pipeline thing.

751
00:32:49,320 --> 00:32:52,830
When things get started the
first value is at this point

752
00:32:52,830 --> 00:32:54,200
done in one clock cycle.

753
00:32:54,200 --> 00:32:56,250
The next value is halfway
through that.

754
00:32:56,250 --> 00:32:58,460
Another value is in
some part of a --

755
00:32:58,460 --> 00:33:00,550
is also pipelined, the
alias pipeline.

756
00:33:00,550 --> 00:33:03,840
And other values are kind of
feeding nicely into that.

757
00:33:03,840 --> 00:33:06,940
So if you have one -- this
is called one lane.

758
00:33:06,940 --> 00:33:10,520
You can have multiple lanes,
and then what you can do is

759
00:33:10,520 --> 00:33:13,230
each cycle you get 40
[UNINTELLIGIBLE]

760
00:33:13,230 --> 00:33:15,000
And the next ones are in
the middle of that,

761
00:33:15,000 --> 00:33:16,020
next ones are in middle.

762
00:33:16,020 --> 00:33:19,310
So what you have is a very
pipelined machine, so you can

763
00:33:19,310 --> 00:33:21,290
kind of pipeline things
in there.

764
00:33:21,290 --> 00:33:23,290
So you can have either one
lane, or multiple lanes

765
00:33:23,290 --> 00:33:25,090
pipeline coming out.

766
00:33:25,090 --> 00:33:27,720
So if you look at the
architecture, what you had is

767
00:33:27,720 --> 00:33:30,230
you have some kind of vector
registers feeding into these

768
00:33:30,230 --> 00:33:32,220
kind of functional units.

769
00:33:32,220 --> 00:33:34,910
So at a given time, in this one
you might be able to get

770
00:33:34,910 --> 00:33:38,030
eight results out, because
everything gets pipelined.

771
00:33:38,030 --> 00:33:42,330
But the same thing is
happening in there.

772
00:33:42,330 --> 00:33:44,720
Clear how vector
machines work?

773
00:33:44,720 --> 00:33:46,880
So it's not really parallelism,
it's basically --

774
00:33:46,880 --> 00:33:50,780
especially if you are one --
it's a superpipelined thing.

775
00:33:50,780 --> 00:33:53,740
But given one instruction, it
will crank out many, many,

776
00:33:53,740 --> 00:33:57,960
many things for that
instruction.

777
00:33:57,960 --> 00:34:00,220
And doing parallelism is easy in
here too, because it's the

778
00:34:00,220 --> 00:34:02,750
same thing happening to very
regular data sets.

779
00:34:02,750 --> 00:34:05,230
So there's no notion of
asynchronizations and all

780
00:34:05,230 --> 00:34:06,160
these weird things.

781
00:34:06,160 --> 00:34:08,980
It's just a very
simple pattern.

782
00:34:08,980 --> 00:34:13,030
So the next thing is the shared
sequencer processor.

783
00:34:13,030 --> 00:34:16,990
So here it's similar to the
vector machines because each

784
00:34:16,990 --> 00:34:20,840
cycle you issue a single
instruction.

785
00:34:20,840 --> 00:34:24,560
But the instruction is
a wide instruction.

786
00:34:24,560 --> 00:34:28,410
It had multiple operations in
these same instructions.

787
00:34:28,410 --> 00:34:29,490
So what it says is --

788
00:34:29,490 --> 00:34:32,190
"I have multiple execution
units, I have memory in a

789
00:34:32,190 --> 00:34:35,280
separate unit, and each
instruction I will tell each

790
00:34:35,280 --> 00:34:40,060
unit what to do." And so
something you might have --

791
00:34:40,060 --> 00:34:43,450
two integer units, two
memory/load store units, two

792
00:34:43,450 --> 00:34:44,360
floating-point units.

793
00:34:44,360 --> 00:34:47,190
Each cycle you tell each
of them what to do.

794
00:34:47,190 --> 00:34:49,210
So you just kind of keep issuing
an instruction that

795
00:34:49,210 --> 00:34:50,330
affects many of them.

796
00:34:50,330 --> 00:34:54,430
So sometimes what happens is if
this has latency of four,

797
00:34:54,430 --> 00:34:56,590
you might have to wait till this
is done to do the next

798
00:34:56,590 --> 00:34:56,940
instruction.

799
00:34:56,940 --> 00:34:59,560
So if one guy takes long,
everybody has to kind

800
00:34:59,560 --> 00:35:00,700
of wait till that.

801
00:35:00,700 --> 00:35:02,180
So it's very synchronous
going.

802
00:35:02,180 --> 00:35:04,150
So things like synchronization
stuff were

803
00:35:04,150 --> 00:35:05,401
not an issue in here.

804
00:35:09,250 --> 00:35:12,430
So if you look at a pipeline,
this is what happens.

805
00:35:12,430 --> 00:35:13,970
So you have this instruction.

806
00:35:13,970 --> 00:35:16,800
It's an instruction, but
you are fetching a wide

807
00:35:16,800 --> 00:35:17,120
instruction.

808
00:35:17,120 --> 00:35:18,430
You are not researching
a simple instruction.

809
00:35:18,430 --> 00:35:20,630
You decode the entire thing,
but you can decode it

810
00:35:20,630 --> 00:35:20,980
separately.

811
00:35:20,980 --> 00:35:23,984
And then you go execute on
each execution unit.

812
00:35:26,770 --> 00:35:28,980
One interesting problem
here was this

813
00:35:28,980 --> 00:35:31,410
was not really scalable.

814
00:35:31,410 --> 00:35:36,530
What happened here is each
functional unit, if you had

815
00:35:36,530 --> 00:35:40,020
one single register file, has
to access the register file.

816
00:35:40,020 --> 00:35:42,670
So each function would say --
"I am using register R1," "I

817
00:35:42,670 --> 00:35:46,060
am using R3," "I am using R5."
So what has to happen is the

818
00:35:46,060 --> 00:35:48,990
register file has to have --

819
00:35:48,990 --> 00:35:53,450
basically, if you have eight
functional units, 16 outports

820
00:35:53,450 --> 00:35:55,190
and 8 inports coming in.

821
00:35:55,190 --> 00:35:57,270
And then of course, when you
build a register file it has a

822
00:35:57,270 --> 00:36:01,880
scale, so it had huge
scalability issues.

823
00:36:01,880 --> 00:36:04,960
So it's a quadratically scalable
register function.

824
00:36:04,960 --> 00:36:05,476
Question?

825
00:36:05,476 --> 00:36:07,540
AUDIENCE: The sequencer
[INAUDIBLE PHRASE]

826
00:36:10,120 --> 00:36:11,370
PROFESSOR: Yeah.

827
00:36:13,270 --> 00:36:15,820
It's basically you had to wait
till everybody's done, there's

828
00:36:15,820 --> 00:36:17,820
nothing going out
of any order.

829
00:36:17,820 --> 00:36:19,150
And memory also.

830
00:36:19,150 --> 00:36:21,950
Since everybody's going to
memory, this is not scalable.

831
00:36:21,950 --> 00:36:26,880
So people try to build -- you
can do four, eight wide, but

832
00:36:26,880 --> 00:36:30,760
beyond that this register and
memory interconnect became a

833
00:36:30,760 --> 00:36:32,770
big mess to build.

834
00:36:32,770 --> 00:36:36,830
And so one kind of modification
thing people did

835
00:36:36,830 --> 00:36:39,690
was called Clustered VLIW.

836
00:36:39,690 --> 00:36:43,560
So what happens is you have a
very wide instruction in here.

837
00:36:43,560 --> 00:36:46,730
It goes to not one cluster,
but different clusters.

838
00:36:46,730 --> 00:36:49,940
Each cluster has its own
register file, its own kind of

839
00:36:49,940 --> 00:36:52,160
memory interconnect
going on there.

840
00:36:52,160 --> 00:36:55,750
And what that means is if you
want to do intercluster

841
00:36:55,750 --> 00:36:58,000
communication, you have to
go to a very special

842
00:36:58,000 --> 00:37:00,060
communication network.

843
00:37:00,060 --> 00:37:03,000
So you don't have this bandwidth
expansion register.

844
00:37:03,000 --> 00:37:06,180
So you only have, we'll say two
execution units, so you

845
00:37:06,180 --> 00:37:10,430
only have to have four
out and one in to the

846
00:37:10,430 --> 00:37:11,900
register filing cycle.

847
00:37:11,900 --> 00:37:15,030
And then if you want other
communication, you have a much

848
00:37:15,030 --> 00:37:17,600
lower bandwidth interconnect
that you'll have

849
00:37:17,600 --> 00:37:18,640
to go through that.

850
00:37:18,640 --> 00:37:23,070
So what this does is you kind
of expose more complexity to

851
00:37:23,070 --> 00:37:28,110
the compiler and software, and
the rationale here is most

852
00:37:28,110 --> 00:37:31,380
programs have locality.

853
00:37:31,380 --> 00:37:33,210
It's like everybody always wants
to to communicate with

854
00:37:33,210 --> 00:37:35,670
everybody else, so there are
some locality in here.

855
00:37:35,670 --> 00:37:38,610
So you can basically cluster
things that are local together

856
00:37:38,610 --> 00:37:41,360
and put it in here, and then
when other things have to be

857
00:37:41,360 --> 00:37:43,880
communicated you can use this
communication and go about

858
00:37:43,880 --> 00:37:44,210
doing that.

859
00:37:44,210 --> 00:37:48,540
So this is kind of the state of
the art in this technology.

860
00:37:48,540 --> 00:37:49,510
And something like --

861
00:37:49,510 --> 00:37:50,410
what I didn't put --

862
00:37:50,410 --> 00:37:52,710
Itanium kind of fits in here.

863
00:37:52,710 --> 00:37:55,830
Itanium processor.

864
00:37:55,830 --> 00:37:59,810
So then we go to
shared network.

865
00:37:59,810 --> 00:38:01,570
There has been a lot
of work in here.

866
00:38:01,570 --> 00:38:05,410
People have been building
multiprocessors for a long

867
00:38:05,410 --> 00:38:07,000
time, because it's a very
easy thing to build.

868
00:38:07,000 --> 00:38:09,870
So what you do is --

869
00:38:09,870 --> 00:38:13,490
if you look at it, you have a
processor unit that connects

870
00:38:13,490 --> 00:38:15,000
its own memory.

871
00:38:15,000 --> 00:38:16,340
And it's like a multiple
[UNINTELLIGIBLE]

872
00:38:16,340 --> 00:38:19,840
Then it has a very tightly
connected network interface

873
00:38:19,840 --> 00:38:21,820
that goes to interconnect
network.

874
00:38:21,820 --> 00:38:26,170
So we can even think about a
workstation farm as this type

875
00:38:26,170 --> 00:38:27,110
of a machine.

876
00:38:27,110 --> 00:38:33,200
But of course, the network is a
pretty slow one that requres

877
00:38:33,200 --> 00:38:34,180
an ethernet connector.

878
00:38:34,180 --> 00:38:35,930
But people build things
that have much

879
00:38:35,930 --> 00:38:39,060
faster networks in there.

880
00:38:39,060 --> 00:38:41,890
This was designed in a way you
can build hundreds and

881
00:38:41,890 --> 00:38:43,580
thousands of these things --

882
00:38:43,580 --> 00:38:44,610
nodes in here.

883
00:38:44,610 --> 00:38:48,760
So today if you look at the
top 500 supercomputers, a

884
00:38:48,760 --> 00:38:51,530
bunch of them fits into this
category because it's very

885
00:38:51,530 --> 00:38:54,510
easy to scale and build
very large.

886
00:38:54,510 --> 00:38:56,647
AUDIENCE: Are you doing
SMPs in this list,

887
00:38:56,647 --> 00:38:57,670
or some other place?

888
00:38:57,670 --> 00:39:00,020
PROFESSOR: SMP is mostly shared

889
00:39:00,020 --> 00:39:01,750
memory, so shared network.

890
00:39:01,750 --> 00:39:03,000
I'll do shared memory next.

891
00:39:06,500 --> 00:39:09,180
But there are problems
with it.

892
00:39:09,180 --> 00:39:12,860
All the data layout has to be
handled by software, or by the

893
00:39:12,860 --> 00:39:15,670
programmer basically.

894
00:39:15,670 --> 00:39:18,100
If you want something outside
your memory, you had to do

895
00:39:18,100 --> 00:39:19,310
very explicit communication.

896
00:39:19,310 --> 00:39:21,470
Not only you, the other guy who
has the data actually has

897
00:39:21,470 --> 00:39:23,420
to cooperate to send
it to you.

898
00:39:23,420 --> 00:39:26,320
And he needs to know that
now you have the data.

899
00:39:26,320 --> 00:39:29,480
All of that management
is your problem.

900
00:39:29,480 --> 00:39:34,020
And that makes programming
these kind of things very

901
00:39:34,020 --> 00:39:36,040
difficult, which you'll probably
figure out by the

902
00:39:36,040 --> 00:39:37,080
time you're done with Cell.

903
00:39:37,080 --> 00:39:41,930
So Cell has a lot of
these issues, too.

904
00:39:41,930 --> 00:39:45,980
The problem here is not dealing
with most of the data,

905
00:39:45,980 --> 00:39:48,200
but the kind of corner
cases that you don't

906
00:39:48,200 --> 00:39:49,520
know about that much.

907
00:39:49,520 --> 00:39:51,695
There's no nice safe way, of
saying -- "I don't know where,

908
00:39:51,695 --> 00:39:52,850
who's going to access it.

909
00:39:52,850 --> 00:39:54,430
I'll let the hardware take
care of it." There's no

910
00:39:54,430 --> 00:39:58,160
hardware, you have to
take of it yourself.

911
00:39:58,160 --> 00:40:02,060
And also message passing has
a very high overhead.

912
00:40:02,060 --> 00:40:04,980
Most of the time in order to do
message, you have to invoke

913
00:40:04,980 --> 00:40:06,130
some kind of a kernel thing.

914
00:40:06,130 --> 00:40:08,240
You have to actually do a kernel
switch that will call

915
00:40:08,240 --> 00:40:09,400
the network --

916
00:40:09,400 --> 00:40:11,990
it's operaing system involves a
process, basically, to get a

917
00:40:11,990 --> 00:40:13,850
message in there.

918
00:40:13,850 --> 00:40:16,250
And also when you get a message
out you have to do

919
00:40:16,250 --> 00:40:21,110
some kind of interrupt or
polling, and that's a bunch of

920
00:40:21,110 --> 00:40:22,140
copies out of kernel.

921
00:40:22,140 --> 00:40:25,040
And this became a pretty
expensive proposition.

922
00:40:25,040 --> 00:40:27,800
So you can't send messages the
size of one [UNINTELLIGIBLE]

923
00:40:27,800 --> 00:40:29,970
so you had to accumulate a huge
amount of things to send

924
00:40:29,970 --> 00:40:31,730
out to amortize the cost
of doing that.

925
00:40:37,430 --> 00:40:39,590
Sending can be somewhat
cheap, but receiving

926
00:40:39,590 --> 00:40:41,180
is a lot more expensive.

927
00:40:41,180 --> 00:40:42,690
Because receiving you
have to multiplex.

928
00:40:42,690 --> 00:40:44,280
You have no idea who
it's coming to.

929
00:40:44,280 --> 00:40:46,070
So you have to receive, you
have to figure out who is

930
00:40:46,070 --> 00:40:47,380
supposed to get it.

931
00:40:47,380 --> 00:40:49,455
Especially if you are running
multiple applications, it

932
00:40:49,455 --> 00:40:50,570
might be for someone's
application.

933
00:40:50,570 --> 00:40:51,810
You had to contact
[UNINTELLIGIBLE]

934
00:40:51,810 --> 00:40:53,060
So it's a big mess.

935
00:40:55,640 --> 00:40:58,800
That is where people went to
shared memory processors,

936
00:40:58,800 --> 00:41:02,040
because it became easier
message method to use.

937
00:41:02,040 --> 00:41:05,480
So that is basically the SMPs
Alan was talking about.

938
00:41:09,350 --> 00:41:12,160
The nice thing is it will work
with any data placement.

939
00:41:12,160 --> 00:41:15,390
It might work very slowly, but
at least it will work.

940
00:41:15,390 --> 00:41:18,860
So it makes it very easy to take
your existing application

941
00:41:18,860 --> 00:41:21,200
and first getting it working,
because it's

942
00:41:21,200 --> 00:41:22,880
just working there.

943
00:41:22,880 --> 00:41:25,700
You can choose to optimize
only critical sections.

944
00:41:25,700 --> 00:41:27,210
You can say -- "OK,
this section I

945
00:41:27,210 --> 00:41:28,290
know it's very important.

946
00:41:28,290 --> 00:41:30,380
I will do the right thing,
I will place it properly

947
00:41:30,380 --> 00:41:33,320
everything." And the rest of it
I can just leave alone, and

948
00:41:33,320 --> 00:41:35,730
it will go and get the
data and do it right.

949
00:41:35,730 --> 00:41:38,020
You can run sequentially, of
course, but at least the

950
00:41:38,020 --> 00:41:39,390
memory part I don't have
to deal with it.

951
00:41:39,390 --> 00:41:43,090
If some other memory just once
in a while accesses that data

952
00:41:43,090 --> 00:41:44,940
that you have actually
parallelized, it

953
00:41:44,940 --> 00:41:46,010
will actually work.

954
00:41:46,010 --> 00:41:47,690
So you only have to worry about
the [UNINTELLIGIBLE]

955
00:41:47,690 --> 00:41:48,940
that you are parallelizing.

956
00:41:51,130 --> 00:41:54,470
And you can communicate using
load store instructions.

957
00:41:54,470 --> 00:41:56,710
You don't have to get always
in order to do that.

958
00:41:56,710 --> 00:41:57,970
And it's a lot lower overhead.

959
00:41:57,970 --> 00:42:02,000
So 5 to 10 cycles, instead of
hundreds to thousands cycles

960
00:42:02,000 --> 00:42:03,030
to do that.

961
00:42:03,030 --> 00:42:05,840
And most of these messages
actually stoplight some

962
00:42:05,840 --> 00:42:08,230
instructions to do this
communication very fast.

963
00:42:08,230 --> 00:42:10,430
There's a thing called fetch&op,
and a thing called

964
00:42:10,430 --> 00:42:12,580
load linked/store conditional
operations.

965
00:42:12,580 --> 00:42:16,125
There are these very special
operations where if you are

966
00:42:16,125 --> 00:42:19,760
waiting for somebody else, you
can do it very fast. So if two

967
00:42:19,760 --> 00:42:21,430
people are communicating.

968
00:42:21,430 --> 00:42:24,550
So people came up with these
very fast operations that are

969
00:42:24,550 --> 00:42:26,320
low cost, as a last --

970
00:42:26,320 --> 00:42:28,230
if the data's available it
will happen very fast.

971
00:42:28,230 --> 00:42:29,480
Synchronization.

972
00:42:31,260 --> 00:42:34,820
And when you are starting to
build a large system, you can

973
00:42:34,820 --> 00:42:37,820
actually give a logically shared
view of memory, but the

974
00:42:37,820 --> 00:42:41,120
underlying hardware can be
still distributed memory.

975
00:42:41,120 --> 00:42:42,260
So there's a thing called --

976
00:42:42,260 --> 00:42:45,060
I will get into when you
do synchronization --

977
00:42:45,060 --> 00:42:46,290
directory-based cache
coherence.

978
00:42:46,290 --> 00:42:48,630
So you give a nice, simple
view of memory.

979
00:42:48,630 --> 00:42:50,250
But of course memory is
really disbributed.

980
00:42:50,250 --> 00:42:52,790
So that kind of gives the
best of both worlds.

981
00:42:52,790 --> 00:42:55,150
So you can keep scaling and
build large machines, but the

982
00:42:55,150 --> 00:42:59,450
view is a very simple
view of machines.

983
00:42:59,450 --> 00:43:00,920
So there are two categories
in here.

984
00:43:00,920 --> 00:43:03,660
One is non-cache coherent, and
then hardware cache coherence.

985
00:43:03,660 --> 00:43:08,450
So non-cache coherence kind of
gives a view of memory as a

986
00:43:08,450 --> 00:43:10,260
single address space.

987
00:43:10,260 --> 00:43:13,020
But you had to deal with that if
you write something to get

988
00:43:13,020 --> 00:43:14,510
there early to me, you had
to explicitly say --

989
00:43:14,510 --> 00:43:17,580
"Now send it to that person."
But we're still in a single

990
00:43:17,580 --> 00:43:19,380
address space.

991
00:43:19,380 --> 00:43:21,790
It doesn't give the
full benefits of a

992
00:43:21,790 --> 00:43:22,600
shared memory machine.

993
00:43:22,600 --> 00:43:24,610
It's kind of inbetween
distributed memory.

994
00:43:24,610 --> 00:43:26,100
In distributed memory basically
everybody's in a

995
00:43:26,100 --> 00:43:27,830
different address space,
so you had to map

996
00:43:27,830 --> 00:43:28,760
by sending a message.

997
00:43:28,760 --> 00:43:30,550
Here, you just say I have
to flush and send it

998
00:43:30,550 --> 00:43:31,800
to the other guy.

999
00:43:36,360 --> 00:43:39,080
Some of the early machines, as
well as some big machines,

1000
00:43:39,080 --> 00:43:42,070
were no hardware cache
coherence.

1001
00:43:42,070 --> 00:43:44,440
Things like supercomputers were
built in this way because

1002
00:43:44,440 --> 00:43:45,980
it's very easy to build.

1003
00:43:45,980 --> 00:43:49,900
And the nice thing here is if
you know your applications

1004
00:43:49,900 --> 00:43:54,280
well, if you are running good
parallel large applications,

1005
00:43:54,280 --> 00:43:55,980
and you are actually knowing
what the communication

1006
00:43:55,980 --> 00:43:57,760
patterns are -- you can
actually do it.

1007
00:43:57,760 --> 00:44:00,430
And you don't have to pay the
hardware overhead to have this

1008
00:44:00,430 --> 00:44:02,470
nice hardware support
in there.

1009
00:44:02,470 --> 00:44:07,230
However, a lot of small scale
machines -- for example, most

1010
00:44:07,230 --> 00:44:12,360
people's workstations are
stuffy, it's probably now two

1011
00:44:12,360 --> 00:44:14,240
Pentium Quad machines --

1012
00:44:14,240 --> 00:44:15,430
actually add memory.

1013
00:44:15,430 --> 00:44:20,430
Because if you are trying to
do the starting things it's

1014
00:44:20,430 --> 00:44:21,540
much easier to do
shared memory.

1015
00:44:21,540 --> 00:44:24,840
And also it's easier to bulid
small shared memory machines.

1016
00:44:24,840 --> 00:44:32,480
And people talk about using a
bus-based machine, and also

1017
00:44:32,480 --> 00:44:33,560
using a large scale

1018
00:44:33,560 --> 00:44:34,818
directory-based machine in here.

1019
00:44:38,170 --> 00:44:42,540
So for bus-based machines, how
do you do shared memory?

1020
00:44:42,540 --> 00:44:46,880
So there's a protocol, what we
call a snoopy cache protocol.

1021
00:44:46,880 --> 00:44:51,050
What that means is, every time
you modify your location

1022
00:44:51,050 --> 00:44:54,120
somewhere -- so of course you
have that in your cache --

1023
00:44:54,120 --> 00:44:57,070
you tell everybody in the world
who's using a busing, "I

1024
00:44:57,070 --> 00:45:03,460
modified that." And then if
somebody else also has that

1025
00:45:03,460 --> 00:45:04,470
memory location.

1026
00:45:04,470 --> 00:45:06,390
That person says, "Oops, he
modified it." Either he

1027
00:45:06,390 --> 00:45:09,160
invalidates it or gets
the modified copy.

1028
00:45:09,160 --> 00:45:12,340
If you are using something new,
you have to go and snoop.

1029
00:45:12,340 --> 00:45:15,040
And you can ask everybody and
say -- "Wait a minute, does

1030
00:45:15,040 --> 00:45:19,160
anybody have a copy of this?"
And some more complicated

1031
00:45:19,160 --> 00:45:22,680
protocols have saying -- "I
don't have any, I have a copy

1032
00:45:22,680 --> 00:45:24,540
but it's only read-only.

1033
00:45:24,540 --> 00:45:26,470
So I'm just reading it, I'm
not modifying it." Then

1034
00:45:26,470 --> 00:45:28,940
multiple people can have the
same copy, because everybody's

1035
00:45:28,940 --> 00:45:29,830
reading and it's OK.

1036
00:45:29,830 --> 00:45:31,840
And then there's the next thing
-- "OK, I am actually

1037
00:45:31,840 --> 00:45:33,550
trying to modify this thing."
And then only I

1038
00:45:33,550 --> 00:45:35,080
can have the copy.

1039
00:45:35,080 --> 00:45:37,830
So some data you can give to
multiple people as a read

1040
00:45:37,830 --> 00:45:40,380
copy, and then when you are
trying to write everybody gets

1041
00:45:40,380 --> 00:45:42,140
disinvited, only the person
who has write

1042
00:45:42,140 --> 00:45:43,090
has access to it.

1043
00:45:43,090 --> 00:45:45,315
And there are a lot of
complicated protocols how if

1044
00:45:45,315 --> 00:45:46,870
you write it, and then somebody
else wants to write

1045
00:45:46,870 --> 00:45:48,680
it, how do you get
to that person?

1046
00:45:48,680 --> 00:45:50,990
And of course you have to keep
it consistent with memory.

1047
00:45:50,990 --> 00:45:53,420
So there is a lot of work in
how to get these things all

1048
00:45:53,420 --> 00:45:55,720
working, but that's the
kind of basic idea.

1049
00:45:59,300 --> 00:46:01,730
So directory-based machines
are very different.

1050
00:46:01,730 --> 00:46:05,060
In directory-based machines
mainly there's a

1051
00:46:05,060 --> 00:46:06,820
notion of a home node.

1052
00:46:06,820 --> 00:46:10,540
So everybody has local space in
memory, you keep some part

1053
00:46:10,540 --> 00:46:10,820
of your memory.

1054
00:46:10,820 --> 00:46:12,720
And of course you have
a cache also.

1055
00:46:12,720 --> 00:46:16,130
So you have a notion that this
memory belongs to you.

1056
00:46:16,130 --> 00:46:18,470
And every time I want to do
something with that memory I

1057
00:46:18,470 --> 00:46:19,390
had to ask you.

1058
00:46:19,390 --> 00:46:20,380
I had to get your permission.

1059
00:46:20,380 --> 00:46:22,560
"I want that memory, can
you give it to me?"

1060
00:46:22,560 --> 00:46:24,610
And so there are two things.

1061
00:46:24,610 --> 00:46:26,670
That person has a directory
[UNINTELLIGIBLE] say -- "OK,

1062
00:46:26,670 --> 00:46:28,150
this memory is in me.

1063
00:46:28,150 --> 00:46:31,480
I am the one who right now owns
it, and I have the copy."

1064
00:46:31,480 --> 00:46:32,420
Or it will say --

1065
00:46:32,420 --> 00:46:36,120
"You want to copy that memory
to this other guy to write,

1066
00:46:36,120 --> 00:46:38,380
and here is that person's
address or that machine's

1067
00:46:38,380 --> 00:46:41,650
name." Or if multiple people
have taken this copy and are

1068
00:46:41,650 --> 00:46:42,730
reading it.

1069
00:46:42,730 --> 00:46:45,240
So when somebody asks
me for a copy --

1070
00:46:45,240 --> 00:46:49,220
assume you ask to
read this copy.

1071
00:46:49,220 --> 00:46:52,890
And if I have given it to nobody
to read, or if I have

1072
00:46:52,890 --> 00:46:54,410
given it to other people
to read, so I say --

1073
00:46:54,410 --> 00:46:55,330
"OK, here's a copy.

1074
00:46:55,330 --> 00:46:58,610
Go read." And I add that person
is reading that, and I

1075
00:46:58,610 --> 00:47:00,190
keep that in my directory.

1076
00:47:00,190 --> 00:47:01,910
Or if somebody's writing that.

1077
00:47:01,910 --> 00:47:04,010
I say sure, "I can't give it
to read because somebody's

1078
00:47:04,010 --> 00:47:05,750
writing that." So I
can do two things.

1079
00:47:05,750 --> 00:47:07,750
I can tell that person,
saying --

1080
00:47:07,750 --> 00:47:11,350
"You have to get it from the
person who's writing.

1081
00:47:11,350 --> 00:47:12,860
So go directly get
it from there.

1082
00:47:12,860 --> 00:47:16,190
And I will mark that now you own
it as a read value." Or, I

1083
00:47:16,190 --> 00:47:17,630
can tell the person
who's writing --

1084
00:47:17,630 --> 00:47:19,400
"Look, you have to give up
your write privilege.

1085
00:47:19,400 --> 00:47:21,990
If you have modified it, give
me the data back." And that

1086
00:47:21,990 --> 00:47:23,950
person goes back to
the read or no

1087
00:47:23,950 --> 00:47:25,330
privileges with that data.

1088
00:47:25,330 --> 00:47:26,860
When I get that data, I'll
send it back to this

1089
00:47:26,860 --> 00:47:27,240
person and say --

1090
00:47:27,240 --> 00:47:29,600
"Here, you can read." And the
same thing if you ask for

1091
00:47:29,600 --> 00:47:30,690
write permission.

1092
00:47:30,690 --> 00:47:33,090
If anybody has [UNINTELLIGIBLE]

1093
00:47:33,090 --> 00:47:34,010
I have to tell everybody --

1094
00:47:34,010 --> 00:47:35,250
"Now you can't read
it anymore.

1095
00:47:35,250 --> 00:47:37,760
Go invalidate, because
somebody's about to write."

1096
00:47:37,760 --> 00:47:39,825
Get the invalidate request
coming back, and then when

1097
00:47:39,825 --> 00:47:42,250
you've done that I say, "OK,
you can write that." So

1098
00:47:42,250 --> 00:47:45,000
everybody keeps part of
the memory, and then

1099
00:47:45,000 --> 00:47:45,720
all of that in there.

1100
00:47:45,720 --> 00:47:48,762
So because of that you can
really scale this thing.

1101
00:47:52,860 --> 00:47:54,700
So if you look at a
bus-based machine.

1102
00:47:54,700 --> 00:47:55,930
This is the kind of
way it looks like.

1103
00:47:55,930 --> 00:47:59,410
You have a cache in here,
microprocessor, central

1104
00:47:59,410 --> 00:48:01,120
memory, and you have
a bus in here.

1105
00:48:01,120 --> 00:48:04,560
And a lot of small machines,
including most people's

1106
00:48:04,560 --> 00:48:06,770
desktops, basically fit
in this category.

1107
00:48:06,770 --> 00:48:09,040
And you have a snoopy
bus in here.

1108
00:48:09,040 --> 00:48:10,200
So a little bit of
a bigger machine,

1109
00:48:10,200 --> 00:48:12,730
something like a Sun Starfire.

1110
00:48:12,730 --> 00:48:17,230
Basically it had four processors
in the board, four

1111
00:48:17,230 --> 00:48:20,250
caches, and had an interconnect
that actually has

1112
00:48:20,250 --> 00:48:21,560
multiple buses going.

1113
00:48:21,560 --> 00:48:23,450
So it can actually get a little
bit of scalability,

1114
00:48:23,450 --> 00:48:24,290
because here's the bottleneck.

1115
00:48:24,290 --> 00:48:25,780
The bus becomes the
bottleneck.

1116
00:48:25,780 --> 00:48:27,400
Everybody has to go
through the bus.

1117
00:48:27,400 --> 00:48:29,570
And so you actually get
multiple buses to get

1118
00:48:29,570 --> 00:48:32,810
bottleneck, and it actually had
some distributed memory

1119
00:48:32,810 --> 00:48:35,160
going through a crossbar here.

1120
00:48:35,160 --> 00:48:36,583
So this cache coherent
protocol has

1121
00:48:36,583 --> 00:48:38,400
to deal with that.

1122
00:48:38,400 --> 00:48:41,100
And going to the other extreme,

1123
00:48:41,100 --> 00:48:43,310
something like SGI Origin.

1124
00:48:46,930 --> 00:48:50,170
In this machine there are two
processors, and it had

1125
00:48:50,170 --> 00:48:52,090
actually a little bit of
processors and a lot of memory

1126
00:48:52,090 --> 00:48:52,830
dealing with the directory.

1127
00:48:52,830 --> 00:48:55,040
So you keep the data, and you
actually keep all the

1128
00:48:55,040 --> 00:48:56,550
directory information
in there --

1129
00:48:56,550 --> 00:48:57,070
in this.

1130
00:48:57,070 --> 00:48:58,850
And then it goes --

1131
00:48:58,850 --> 00:49:02,740
then after that it almost uses
a normal message passing type

1132
00:49:02,740 --> 00:49:05,420
network to communicate
with that.

1133
00:49:05,420 --> 00:49:07,520
And they use the crane to
connect networks, so we can

1134
00:49:07,520 --> 00:49:09,660
have a very large machine
built out of that.

1135
00:49:12,720 --> 00:49:14,450
So now let's switch to
multicore processors.

1136
00:49:18,200 --> 00:49:21,930
If you look at the way we have
been dealing with VLSI, every

1137
00:49:21,930 --> 00:49:24,920
generation we are getting more
and more transistors.

1138
00:49:24,920 --> 00:49:27,470
So at the beginning when you
have enough transistors to

1139
00:49:27,470 --> 00:49:29,860
deal with, people actually start
dealing with bit-level

1140
00:49:29,860 --> 00:49:30,960
parallelism.

1141
00:49:30,960 --> 00:49:35,270
So you didn't have -- you can
do 16-bit, 32-bit machines.

1142
00:49:35,270 --> 00:49:36,990
You can do wider machines,
because you have enough

1143
00:49:36,990 --> 00:49:37,850
transistors.

1144
00:49:37,850 --> 00:49:39,610
Because at the beginning you
have like 8-bit processors,

1145
00:49:39,610 --> 00:49:41,110
16-bit, 32-bit.

1146
00:49:41,110 --> 00:49:43,790
And then at some point that I
have still more transistors, I

1147
00:49:43,790 --> 00:49:47,660
start doing instruction-level
parallelism in a die.

1148
00:49:47,660 --> 00:49:50,080
So even in a bit-level
parallelism, in order to get

1149
00:49:50,080 --> 00:49:53,830
64-bit you had to actually
have multiple chips.

1150
00:49:53,830 --> 00:49:57,135
So in this regime in order to
get parallelism, you need to

1151
00:49:57,135 --> 00:49:58,150
have multiple processors --

1152
00:49:58,150 --> 00:49:59,370
multiprocessors.

1153
00:49:59,370 --> 00:50:02,860
So in the good old days you
actually built a processsor,

1154
00:50:02,860 --> 00:50:03,950
things like a minicomputer.

1155
00:50:03,950 --> 00:50:06,620
Basically you had one
processor dealing

1156
00:50:06,620 --> 00:50:07,380
with a 1-bit slice.

1157
00:50:07,380 --> 00:50:10,700
So in the 4-bit slice, dealing
with that amount, you could

1158
00:50:10,700 --> 00:50:12,230
fit in a chip.

1159
00:50:12,230 --> 00:50:14,550
And a multichip made
a single processor.

1160
00:50:14,550 --> 00:50:17,870
Here a multichip made
a multiprocessor.

1161
00:50:17,870 --> 00:50:20,510
We are hitting a regime
where a multichip --

1162
00:50:20,510 --> 00:50:22,870
what [? it ?] will be
multiprocessor -- now fits in

1163
00:50:22,870 --> 00:50:26,030
one piece of silicon, because
you have more transistors.

1164
00:50:26,030 --> 00:50:29,560
So we are going into a time
where multicore is basically

1165
00:50:29,560 --> 00:50:31,630
multiple processors
on a die --

1166
00:50:31,630 --> 00:50:33,790
on a chip.

1167
00:50:33,790 --> 00:50:35,140
So I showed this slide.

1168
00:50:35,140 --> 00:50:39,650
We are getting there, and it's
getting pretty fast. You had

1169
00:50:39,650 --> 00:50:41,450
something like this, and
suddenly we accelerated.

1170
00:50:41,450 --> 00:50:46,530
We added more and more
cores on a die.

1171
00:50:46,530 --> 00:50:50,000
So I categorized multicores
also the way I categorized

1172
00:50:50,000 --> 00:50:51,020
them previously.

1173
00:50:51,020 --> 00:50:54,850
There are shared memory
multicores.

1174
00:50:54,850 --> 00:50:56,180
Here are some examples.

1175
00:50:56,180 --> 00:50:59,100
Then there are shared
network multicores.

1176
00:50:59,100 --> 00:51:01,930
Cell processor is one,
and at MIT we are

1177
00:51:01,930 --> 00:51:04,440
building also Raw processor.

1178
00:51:04,440 --> 00:51:07,700
And there is another part, what
they call crippled or

1179
00:51:07,700 --> 00:51:08,550
mini-cores.

1180
00:51:08,550 --> 00:51:15,000
So the reason in this graph you
can have 512, is because

1181
00:51:15,000 --> 00:51:17,130
it's not Pentium sized things
sitting in there.

1182
00:51:17,130 --> 00:51:20,940
You are putting very simple
small cores, and a

1183
00:51:20,940 --> 00:51:21,940
huge amount of them.

1184
00:51:21,940 --> 00:51:24,890
So for some class replication,
that's also useful.

1185
00:51:24,890 --> 00:51:29,120
So if you look at shared memory
multicores, basically

1186
00:51:29,120 --> 00:51:32,730
this is an evolution path
for current processors.

1187
00:51:32,730 --> 00:51:35,890
So if you look at it, what they
did was they took their

1188
00:51:35,890 --> 00:51:38,160
years' worth of and billions
of dollars worth of

1189
00:51:38,160 --> 00:51:42,880
engineering building a single
superscalar processor.

1190
00:51:42,880 --> 00:51:45,456
Then they slapped a few of them
on the same die, and said

1191
00:51:45,456 --> 00:51:48,390
-- "Hey, we've got a multicore."
And of course they

1192
00:51:48,390 --> 00:51:54,450
were always doing shared memory
at the network level.

1193
00:51:54,450 --> 00:51:56,220
They said -- "OK, I'll put the
shared memory bus also into

1194
00:51:56,220 --> 00:51:58,340
the same die, and I got a
multicore." So this is

1195
00:51:58,340 --> 00:52:00,440
basically what all these
things are all about.

1196
00:52:00,440 --> 00:52:03,170
So this is kind of gluing these
things together, it's a

1197
00:52:03,170 --> 00:52:04,240
first generation.

1198
00:52:04,240 --> 00:52:07,740
However, you didn't build a core
completely from scratch.

1199
00:52:07,740 --> 00:52:11,330
You just kind of integrated what
we had in multiple chips

1200
00:52:11,330 --> 00:52:15,880
into one chip, and basically
got that.

1201
00:52:15,880 --> 00:52:19,640
So to go a little bit beyond,
I think you can do better.

1202
00:52:19,640 --> 00:52:24,260
So for example, this
AMD multicore.

1203
00:52:24,260 --> 00:52:31,240
Basically you have CPUs in
there, actually have a full

1204
00:52:31,240 --> 00:52:34,400
snoopy controller in there,
and can have some other

1205
00:52:34,400 --> 00:52:35,280
interface with that.

1206
00:52:35,280 --> 00:52:38,900
So you can actually start
building more and more uni

1207
00:52:38,900 --> 00:52:41,440
CPU, thinking that you're
building a multicore.

1208
00:52:41,440 --> 00:52:43,745
Instead of saying, "I had this
thing in my shelf, I'm going

1209
00:52:43,745 --> 00:52:45,480
to plop it here, and then
kind of [INAUDIBLE]

1210
00:52:45,480 --> 00:52:46,950
And you'll see, I
think, a lot of

1211
00:52:46,950 --> 00:52:48,100
interesting things happening.

1212
00:52:48,100 --> 00:52:52,310
Because now as they're connected
closely in the same

1213
00:52:52,310 --> 00:52:56,170
die, you can do more things than
what you could do in a

1214
00:52:56,170 --> 00:52:57,000
multiprocessor.

1215
00:52:57,000 --> 00:52:59,300
So in the last lecture we talked
a little bit about what

1216
00:52:59,300 --> 00:53:01,530
the future could be in
this kind of regime.

1217
00:53:10,040 --> 00:53:11,290
Come on.

1218
00:53:13,930 --> 00:53:14,500
OK.

1219
00:53:14,500 --> 00:53:18,560
So one thing we have been doing
at MIT for -- now this

1220
00:53:18,560 --> 00:53:23,190
practice is ended, we started
about eight years ago -- is to

1221
00:53:23,190 --> 00:53:28,050
figure out when you have all the
silicon, how can you build

1222
00:53:28,050 --> 00:53:30,460
a multicore if you to
start from scratch.

1223
00:53:30,460 --> 00:53:33,120
So we built this Raw processor
where each --

1224
00:53:33,120 --> 00:53:37,100
we have 16, these small cores,
identical ones in here.

1225
00:53:37,100 --> 00:53:40,260
And the interesting thing is
what we said was, we have all

1226
00:53:40,260 --> 00:53:41,500
this bandwidth.

1227
00:53:41,500 --> 00:53:44,060
It's not just going from pins
to memory, we have all this

1228
00:53:44,060 --> 00:53:45,580
bandwidth sitting next
to each other.

1229
00:53:45,580 --> 00:53:48,990
So can we really take advantage
of that to do a lot

1230
00:53:48,990 --> 00:53:50,240
of communication?

1231
00:53:50,240 --> 00:53:52,300
And also the other thing is that
to build something like a

1232
00:53:52,300 --> 00:53:54,850
bus, you need a lot
of long wires.

1233
00:53:54,850 --> 00:53:56,940
And it's really hard to
build long wires.

1234
00:53:56,940 --> 00:54:00,770
So in Raw processor it's
something like each chip, a

1235
00:54:00,770 --> 00:54:05,430
large amount of part, is into
this eight 32-bit buses.

1236
00:54:05,430 --> 00:54:06,940
So you have a huge amount
of communication

1237
00:54:06,940 --> 00:54:07,950
next to each other.

1238
00:54:07,950 --> 00:54:10,320
And we don't have any kind of
global memory because that

1239
00:54:10,320 --> 00:54:12,400
requires, right now, either do
a directory, which you didn't

1240
00:54:12,400 --> 00:54:15,750
want to build, or have a bus,
which will require long wires.

1241
00:54:15,750 --> 00:54:19,570
So we did in a way that all
wires -- no wires longer than

1242
00:54:19,570 --> 00:54:22,830
one of the cores.

1243
00:54:22,830 --> 00:54:25,980
So we can do short wires, but
we came up with a lot of

1244
00:54:25,980 --> 00:54:29,380
communications for each of
these, what we called tile

1245
00:54:29,380 --> 00:54:32,170
those days, are very
tightly coupled.

1246
00:54:32,170 --> 00:54:35,730
So this is kind of a direction
where people perhaps might go,

1247
00:54:35,730 --> 00:54:39,580
because now we have all this
bandwidth in here.

1248
00:54:39,580 --> 00:54:41,260
And how would you take advantage
of that bandwidth?

1249
00:54:41,260 --> 00:54:43,720
So this is a different way
of looking at that.

1250
00:54:43,720 --> 00:54:47,970
And in some sense the Cell fits
somewhere in this regime.

1251
00:54:47,970 --> 00:54:51,070
Because what Cell did was
instead of -- it says, "I'm

1252
00:54:51,070 --> 00:54:52,300
not building a bus,
I am actually

1253
00:54:52,300 --> 00:54:53,750
building a ring network.

1254
00:54:53,750 --> 00:54:57,000
I'm keeping distributed memory,
and provide to Cell a

1255
00:54:57,000 --> 00:54:58,910
ring." I'm not going to go
through Cell, because actually

1256
00:54:58,910 --> 00:55:03,457
you had a full lecture the day
before yesterday on this.

1257
00:55:03,457 --> 00:55:04,888
AUDIENCE: Saman, can I
ask you a question?

1258
00:55:04,888 --> 00:55:07,325
Is there a conclusion that I
should be reaching in that I

1259
00:55:07,325 --> 00:55:09,405
look at the multicores you can
buy today are still by and

1260
00:55:09,405 --> 00:55:11,085
large two and four processors.

1261
00:55:11,085 --> 00:55:12,280
There are people that
have done more.

1262
00:55:12,280 --> 00:55:15,480
The Verano has 16 and
the Dell has 8.

1263
00:55:15,480 --> 00:55:19,530
And the conclusion that I want
to reach is that as an

1264
00:55:19,530 --> 00:55:21,635
engineering tradeoff, if you
throw away the shared memory

1265
00:55:21,635 --> 00:55:23,070
you can add processors.

1266
00:55:23,070 --> 00:55:24,120
Is that a straightforward
tradeoff?

1267
00:55:24,120 --> 00:55:26,140
PROFESSOR: I don't think
it's a shared memory.

1268
00:55:26,140 --> 00:55:29,600
You can still have things
like directory-based

1269
00:55:29,600 --> 00:55:32,200
cache coherent things.

1270
00:55:32,200 --> 00:55:34,940
What's missing right now is what
people have done is just

1271
00:55:34,940 --> 00:55:37,570
basically took parts in their
shelves, and kind of put it

1272
00:55:37,570 --> 00:55:39,230
into the chip.

1273
00:55:39,230 --> 00:55:43,830
If you look at it, if you put
two chips next to each other

1274
00:55:43,830 --> 00:55:46,370
on a board, there's a certain
amount of communication

1275
00:55:46,370 --> 00:55:48,020
bandwidth going here.

1276
00:55:48,020 --> 00:55:51,640
And if you put those things
into the same die, there's

1277
00:55:51,640 --> 00:55:55,430
about five orders of magnitude
possibility to communicate.

1278
00:55:55,430 --> 00:55:58,080
We haven't figured out how to
take advantage of that.

1279
00:55:58,080 --> 00:56:00,770
In some sense, we can almost say
I want to copy the entire

1280
00:56:00,770 --> 00:56:04,180
cache from this machine to
another machine in the cycle.

1281
00:56:04,180 --> 00:56:06,440
I don't think you even would
want to do that, but you can

1282
00:56:06,440 --> 00:56:09,280
have that level of huge amount
of communication.

1283
00:56:09,280 --> 00:56:11,530
We are still kind of doing
this evolutionary path in

1284
00:56:11,530 --> 00:56:15,600
there [UNINTELLIGIBLE] but I
don't think we know what cool

1285
00:56:15,600 --> 00:56:16,660
things we can do with that.

1286
00:56:16,660 --> 00:56:19,050
There's a lot of opportunity
in that in some sense.

1287
00:56:19,050 --> 00:56:20,760
AUDIENCE: [INAUDIBLE]

1288
00:56:20,760 --> 00:56:23,240
PROFESSOR: Yeah, because the
interesting thing is --

1289
00:56:23,240 --> 00:56:26,920
the way I would say it is,
in the good old days

1290
00:56:26,920 --> 00:56:29,190
parallelization sometimes
was a scary prospect.

1291
00:56:29,190 --> 00:56:31,510
Because the minute you
distribute data, if you don't

1292
00:56:31,510 --> 00:56:35,610
do it right it's a lot slower
than sequential execution.

1293
00:56:35,610 --> 00:56:39,100
Because your access time becomes
so large, and you're

1294
00:56:39,100 --> 00:56:40,540
basically dead in water.

1295
00:56:40,540 --> 00:56:42,610
In this kind of machine
you don't have to.

1296
00:56:42,610 --> 00:56:44,950
There's so much bandwidth
in here.

1297
00:56:44,950 --> 00:56:47,130
Latency was still -- latency
would be better than going to

1298
00:56:47,130 --> 00:56:49,800
the outside memory.

1299
00:56:49,800 --> 00:56:51,610
And we don't know how
to take advantage of

1300
00:56:51,610 --> 00:56:53,040
that bandwidth yet.

1301
00:56:53,040 --> 00:56:57,310
And my feeling is as we go about
trying to rebuild from

1302
00:56:57,310 --> 00:57:02,440
scratch multicore processors,
we'll try to figure out

1303
00:57:02,440 --> 00:57:03,060
different ways.

1304
00:57:03,060 --> 00:57:10,510
So for example, people are
coming up with much more rich

1305
00:57:10,510 --> 00:57:14,860
semantics for speculation and
stuff like that, and we can

1306
00:57:14,860 --> 00:57:16,580
take advantage of that.

1307
00:57:16,580 --> 00:57:20,980
So I think there's a lot of
interesting hardware,

1308
00:57:20,980 --> 00:57:24,910
microprocessor, and then kind
of programming research now.

1309
00:57:24,910 --> 00:57:27,770
Because I don't think anybody
had anything in there saying ,

1310
00:57:27,770 --> 00:57:30,130
"Here's how we would take it
down to this bandwidth." I

1311
00:57:30,130 --> 00:57:31,810
think that'll happen.

1312
00:57:31,810 --> 00:57:35,480
Now the next [? thing ?]
is these mini-cores.

1313
00:57:35,480 --> 00:57:38,070
So for example, this PicoChip
has array of

1314
00:57:38,070 --> 00:57:39,720
322 processing elements.

1315
00:57:39,720 --> 00:57:43,010
They have 16-bit RISC, so
it's not even a 32-bit.

1316
00:57:43,010 --> 00:57:44,950
Piddling little things,
3-way issue in.

1317
00:57:44,950 --> 00:57:48,980
And they had like
240 standard --

1318
00:57:48,980 --> 00:57:50,370
basically, nothing
more than just a

1319
00:57:50,370 --> 00:57:52,850
multiplier, and add in there.

1320
00:57:52,850 --> 00:57:56,880
64 memory tiles, full control,
and 14 some special

1321
00:57:56,880 --> 00:57:58,480
[UNINTELLIGIBLE] function
accelerator.

1322
00:57:58,480 --> 00:58:03,240
So this is kind of what people
call heterogeneous systems.

1323
00:58:03,240 --> 00:58:05,505
Where what this is -- you have
all these cores, why do you

1324
00:58:05,505 --> 00:58:07,160
make everything the same?

1325
00:58:07,160 --> 00:58:09,450
I can make something that's good
doing graphics, something

1326
00:58:09,450 --> 00:58:11,110
that's good doing networking.

1327
00:58:11,110 --> 00:58:13,540
So I can kind of customize
in these things.

1328
00:58:13,540 --> 00:58:15,350
Because what we have in
excess is silicon.

1329
00:58:15,350 --> 00:58:17,080
We don't have power in excess.

1330
00:58:17,080 --> 00:58:21,250
So in the future you can't
assume everything is working

1331
00:58:21,250 --> 00:58:22,600
all the time, because
that will still

1332
00:58:22,600 --> 00:58:24,310
create too much heat.

1333
00:58:24,310 --> 00:58:27,710
So you kind of say -- the best
efficiencies, for each type of

1334
00:58:27,710 --> 00:58:30,170
computation you have some few
special purpose units.

1335
00:58:30,170 --> 00:58:34,680
So we kind of say if I'm doing
graphics, I fit to my graphics

1336
00:58:34,680 --> 00:58:35,500
optimized code.

1337
00:58:35,500 --> 00:58:36,190
So I will do that.

1338
00:58:36,190 --> 00:58:38,570
And the minute I want to do a
little bit of arithmetic I'll

1339
00:58:38,570 --> 00:58:39,620
switch to that.

1340
00:58:39,620 --> 00:58:43,190
And sometimes I am doing TCP,
I'll switch to my TCP offload.

1341
00:58:43,190 --> 00:58:43,770
Stuff like that.

1342
00:58:43,770 --> 00:58:46,040
Can you do some kind
of mixed in there?

1343
00:58:46,040 --> 00:58:48,880
The problem there is you need to
understand what the mix is.

1344
00:58:48,880 --> 00:58:50,600
So we need to have a good
understanding of

1345
00:58:50,600 --> 00:58:51,880
what that mix is.

1346
00:58:51,880 --> 00:58:54,360
The advantage is it will be a
lot more memory efficient.

1347
00:58:54,360 --> 00:58:56,930
So this is kind of going
in that direction.

1348
00:58:56,930 --> 00:59:00,550
And so in some sense, if you
want to communicate you have

1349
00:59:00,550 --> 00:59:03,280
these special communication
elements.

1350
00:59:03,280 --> 00:59:04,280
You have to go through that.

1351
00:59:04,280 --> 00:59:06,540
And the processor can do some
work, and there are some

1352
00:59:06,540 --> 00:59:07,340
memory elements.

1353
00:59:07,340 --> 00:59:08,630
So far and so forth.

1354
00:59:08,630 --> 00:59:11,950
So that's one push, people are
pushing more for embedded very

1355
00:59:11,950 --> 00:59:13,120
low power in.

1356
00:59:13,120 --> 00:59:15,770
AUDIENCE: Is this starting to
look more and more like FPGA,

1357
00:59:15,770 --> 00:59:16,830
which is [UNINTELLIGIBLE]

1358
00:59:16,830 --> 00:59:20,660
PROFESSOR: Yeah, it's a
kind of a combination.

1359
00:59:20,660 --> 00:59:25,300
Because the thing about FPGA is,
it's just done 1-bit lot.

1360
00:59:25,300 --> 00:59:27,950
That doesn't make sense
to do any arithmetic.

1361
00:59:27,950 --> 00:59:30,550
So this is saying -- "Ok,
instead of 1 bit I am doing 16

1362
00:59:30,550 --> 00:59:34,660
bits." Because then I can
very efficiently build

1363
00:59:34,660 --> 00:59:35,760
[UNINTELLIGIBLE]

1364
00:59:35,760 --> 00:59:36,960
Because I don't have to
build [UNINTELLIGIBLE]

1365
00:59:36,960 --> 00:59:38,890
out of scratch.

1366
00:59:38,890 --> 00:59:42,140
So I think that an interesting
convergence is happening.

1367
00:59:42,140 --> 00:59:45,930
Because what happened, I think,
for a long time was

1368
00:59:45,930 --> 00:59:47,860
things like architecture and
programming languages, and

1369
00:59:47,860 --> 00:59:50,220
stuff like that, kind of
got stuck in a rut.

1370
00:59:50,220 --> 00:59:52,320
Because things there are
so very efficiently and

1371
00:59:52,320 --> 00:59:56,270
incremental -- it's like doing
research in airplanes.

1372
00:59:56,270 --> 00:59:58,760
Things are so efficient,
so complex.

1373
00:59:58,760 --> 01:00:05,020
Here AeroAstro can't build an
airplane, because it's a $9

1374
01:00:05,020 --> 01:00:10,000
billion job to build a good
airplane in there.

1375
01:00:10,000 --> 01:00:11,380
And it became like that.

1376
01:00:11,380 --> 01:00:13,350
Universities could not build
it because if you want to

1377
01:00:13,350 --> 01:00:16,610
build a superscalar it's,
again, a $9 billion type

1378
01:00:16,610 --> 01:00:19,130
endeavor to do that -- thousands
of people, was very,

1379
01:00:19,130 --> 01:00:20,020
very customized.

1380
01:00:20,020 --> 01:00:22,670
But now it's kind of hitting
the end of the road.

1381
01:00:22,670 --> 01:00:24,562
Everbody's going back and saying
-- "Jeez, what's the

1382
01:00:24,562 --> 01:00:26,090
new thing?" And I think there's
a lot of opportunity

1383
01:00:26,090 --> 01:00:29,270
to kind of figure out is there
some radically different thing

1384
01:00:29,270 --> 01:00:30,340
you can do.

1385
01:00:30,340 --> 01:00:33,640
So this is what I have
for my first lecture.

1386
01:00:33,640 --> 01:00:35,130
Some conclusions basically.

1387
01:00:35,130 --> 01:00:38,530
I think for a lot of people who
are programmers, there was

1388
01:00:38,530 --> 01:00:42,210
a time that you never cared
about what's under the hood.

1389
01:00:42,210 --> 01:00:44,200
You knew it was going to
go fast, and in the

1390
01:00:44,200 --> 01:00:45,290
air it will go faster.

1391
01:00:45,290 --> 01:00:47,420
I think that's kind of
coming to an end.

1392
01:00:47,420 --> 01:00:49,480
And there's a lot of
variations/choices in

1393
01:00:49,480 --> 01:00:51,900
hardware, and I think software
people should understand and

1394
01:00:51,900 --> 01:00:54,970
know what they can
choose in here.

1395
01:00:54,970 --> 01:00:57,630
And many have performance
implications.

1396
01:00:57,630 --> 01:01:01,710
And if you know these things you
will be able to get high

1397
01:01:01,710 --> 01:01:03,070
performance of software
built easy.

1398
01:01:03,070 --> 01:01:05,570
You can't do high performance
software without knowing what

1399
01:01:05,570 --> 01:01:07,190
it's running on.

1400
01:01:07,190 --> 01:01:09,860
However, there's a
note of caution.

1401
01:01:09,860 --> 01:01:13,550
If you become too much attached
to your hardware, we

1402
01:01:13,550 --> 01:01:16,270
go back to the old days of
assembly language programming.

1403
01:01:16,270 --> 01:01:19,910
So you say -- "I got every
performance out of a -- now

1404
01:01:19,910 --> 01:01:24,090
the Cell says you have seven
SPEs." So in two years, they

1405
01:01:24,090 --> 01:01:25,290
come with 16 SPEs.

1406
01:01:25,290 --> 01:01:26,080
And what's going to happen?

1407
01:01:26,080 --> 01:01:28,920
Your thing is still working on
seven SPEs very well, but it

1408
01:01:28,920 --> 01:01:31,020
might not work on 16 SPEs,
even with that.

1409
01:01:31,020 --> 01:01:33,700
But of course, you really
customize for Cell too.

1410
01:01:33,700 --> 01:01:36,780
And I guarantee it will not
run good with the Intel --

1411
01:01:36,780 --> 01:01:39,670
probably Quad, Xeon processor
-- because it will be doing

1412
01:01:39,670 --> 01:01:41,040
something very different.

1413
01:01:41,040 --> 01:01:44,950
And so there's this tension
that's coming back again.

1414
01:01:44,950 --> 01:01:48,540
How to do something that is
general, portable, malleable,

1415
01:01:48,540 --> 01:01:52,255
and at the same time get good
performance with hardware

1416
01:01:52,255 --> 01:01:52,770
being exposed.

1417
01:01:52,770 --> 01:01:54,020
I don't think there's
an answer for that.

1418
01:01:54,020 --> 01:01:55,870
And in this class we are going
to go to one extreme.

1419
01:01:55,870 --> 01:01:58,710
We are going to go low level
and really understand the

1420
01:01:58,710 --> 01:02:01,540
hardware, and take advantage
of that.

1421
01:02:01,540 --> 01:02:04,340
But at some point we have to
probably come out of that and

1422
01:02:04,340 --> 01:02:06,420
figure out how to be,
again, high level.

1423
01:02:06,420 --> 01:02:09,137
And I think that these
are open questions.

1424
01:02:09,137 --> 01:02:10,965
AUDIENCE: Do you have any
thoughts, and this may be

1425
01:02:10,965 --> 01:02:15,620
unanswerable, but how could
Cell really [INAUDIBLE].

1426
01:02:15,620 --> 01:02:18,970
And not Cell only, but some of
these other ones that are out

1427
01:02:18,970 --> 01:02:22,870
there today, given how hard
they are to program.

1428
01:02:22,870 --> 01:02:25,200
PROFESSOR: So I have
this talk that I'm

1429
01:02:25,200 --> 01:02:25,860
giving at all the places.

1430
01:02:25,860 --> 01:02:28,320
I said the third software
crisis is due

1431
01:02:28,320 --> 01:02:30,340
to multicore menace.

1432
01:02:30,340 --> 01:02:35,090
I termed it a menace, because it
will create this thing that

1433
01:02:35,090 --> 01:02:36,000
people will have to change.

1434
01:02:36,000 --> 01:02:38,410
Something has to change,
something has to give.

1435
01:02:38,410 --> 01:02:40,300
I don't know who's
going to give.

1436
01:02:40,300 --> 01:02:42,560
Either people will say -- "This
is too complicated, I am

1437
01:02:42,560 --> 01:02:44,050
happy with the current
performance.

1438
01:02:44,050 --> 01:02:46,550
I will live for the next 20
years at today's level of

1439
01:02:46,550 --> 01:02:51,070
performance." I doubt
that will happen.

1440
01:02:51,070 --> 01:02:53,290
The other end is saying --
"Jeez, you know I am going to

1441
01:02:53,290 --> 01:02:56,410
learn parallel programming, and
I will deal with locks and

1442
01:02:56,410 --> 01:02:58,060
semaphores, and all
those things.

1443
01:02:58,060 --> 01:03:00,080
And I am going to jump in
there." That's not going to

1444
01:03:00,080 --> 01:03:01,040
happen either.

1445
01:03:01,040 --> 01:03:02,790
So there has to be something
in the middle.

1446
01:03:02,790 --> 01:03:04,380
And the neat thing is,
I don't think anybody

1447
01:03:04,380 --> 01:03:07,650
knows what it is.

1448
01:03:07,650 --> 01:03:12,120
Being in industry, it makes them
terrified, because they

1449
01:03:12,120 --> 01:03:13,190
have no idea what's happening.

1450
01:03:13,190 --> 01:03:14,360
But in a university,
it's a fun time.

1451
01:03:14,360 --> 01:03:17,220
[LAUGHTER]

1452
01:03:17,220 --> 01:03:18,650
AUDIENCE: Good question.

1453
01:03:18,650 --> 01:03:18,890
PROFESSOR: OK.

1454
01:03:18,890 --> 01:03:21,850
So we'll take about a five
minutes break, and switch

1455
01:03:21,850 --> 01:03:24,490
gears into concurrent
programming.