1
00:00:00,030 --> 00:00:02,430
The following content is
provided under a Creative

2
00:00:02,430 --> 00:00:03,850
Commons license.

3
00:00:03,850 --> 00:00:06,920
Your support will help MIT
OpenCourseWare continue to

4
00:00:06,920 --> 00:00:10,560
offer high quality educational
resources for free.

5
00:00:10,560 --> 00:00:13,410
To make a donation or view
additional materials from

6
00:00:13,410 --> 00:00:17,510
hundreds of MIT courses, visit
MIT OpenCourseWare at

7
00:00:17,510 --> 00:00:18,760
ocw.mit.edu.

8
00:00:21,270 --> 00:00:22,890
PROFESSOR RABBAH: OK, so today's
the last lecture day

9
00:00:22,890 --> 00:00:25,150
we're going to talk about
the raw architecture.

10
00:00:25,150 --> 00:00:27,730
This is a processor that was
built here at MIT and

11
00:00:27,730 --> 00:00:30,650
essentially trailblazed a lot
of the research in terms of

12
00:00:30,650 --> 00:00:35,190
parallel architectures for
multicores, compilation for

13
00:00:35,190 --> 00:00:37,160
multicores, programming
language and so on.

14
00:00:37,160 --> 00:00:39,950
So you've heard some things
about RAW and the

15
00:00:39,950 --> 00:00:42,200
parallelizing technology
in terms of StreamIt.

16
00:00:42,200 --> 00:00:45,010
So we're going to cover some of
that again here today just

17
00:00:45,010 --> 00:00:48,460
briefly and give you little
bit more insight into what

18
00:00:48,460 --> 00:00:51,240
went into the design of
the raw architecture.

19
00:00:51,240 --> 00:00:54,150
So these are RAW chips
they were delivered

20
00:00:54,150 --> 00:00:56,040
in October of 2002.

21
00:00:56,040 --> 00:00:59,920
Each one of these has
16 processors on it.

22
00:00:59,920 --> 00:01:04,040
I'm going to show you sort of
a diagram on the next slide.

23
00:01:04,040 --> 00:01:05,900
It's really a tiled
microprocessor.

24
00:01:05,900 --> 00:01:09,550
We'll get into what that means
and how it actually--

25
00:01:09,550 --> 00:01:11,930
what does a tiled microprocessor
give you that

26
00:01:11,930 --> 00:01:14,210
makes it an attractive
design point in the

27
00:01:14,210 --> 00:01:16,130
architecture space?

28
00:01:16,130 --> 00:01:18,640
Each of the raw tiles-- you can
sort of see the outline

29
00:01:18,640 --> 00:01:21,790
here sort of replicates--

30
00:01:21,790 --> 00:01:23,060
is four millimeters.

31
00:01:23,060 --> 00:01:25,800
It's four millimeters square.

32
00:01:25,800 --> 00:01:28,680
It's a single-issue
8-stage pipeline.

33
00:01:28,680 --> 00:01:32,820
It has local memory, so
there's a 32K cache.

34
00:01:32,820 --> 00:01:36,230
And the unique aspect of the raw
processor is that is has a

35
00:01:36,230 --> 00:01:38,980
lot of on-chip networks that you
could use to orchestrate

36
00:01:38,980 --> 00:01:41,210
communication between
processors.

37
00:01:41,210 --> 00:01:43,080
So there's two operand
networks.

38
00:01:43,080 --> 00:01:44,350
I'm going to get into
what that means and

39
00:01:44,350 --> 00:01:45,700
what they used for.

40
00:01:45,700 --> 00:01:47,480
But these eventually allow
you to do point-to-point

41
00:01:47,480 --> 00:01:50,260
communication between tiles
with very low latency.

42
00:01:50,260 --> 00:01:53,280
And then there's a network that
essentially allows you to

43
00:01:53,280 --> 00:01:56,770
handle cache misses and input
and output and one for message

44
00:01:56,770 --> 00:02:00,850
passings, a more dynamic style
of messaging, something

45
00:02:00,850 --> 00:02:03,850
similar to what you're
accustomed to at the cell, for

46
00:02:03,850 --> 00:02:06,440
example, in DMA transfers.

47
00:02:06,440 --> 00:02:09,570
This was built in 180
nanometer ASIC

48
00:02:09,570 --> 00:02:10,510
technology by IBM.

49
00:02:10,510 --> 00:02:12,810
It's got 100 million
transistors.

50
00:02:12,810 --> 00:02:16,140
It was designed here by
MIT grad students.

51
00:02:16,140 --> 00:02:20,240
It's got something like
a million gates on it.

52
00:02:20,240 --> 00:02:22,270
Three to four years of
development time.

53
00:02:22,270 --> 00:02:23,940
And what was really interesting
here is that this

54
00:02:23,940 --> 00:02:27,590
was-- because of the tiled
nature of the architecture,

55
00:02:27,590 --> 00:02:31,050
you could just design one tile
and then once you have one

56
00:02:31,050 --> 00:02:32,850
tile, you essentially just plop
down more and more and

57
00:02:32,850 --> 00:02:33,590
more of them.

58
00:02:33,590 --> 00:02:36,710
And so you have one, you scale
it out to 16 tiles.

59
00:02:36,710 --> 00:02:39,680
And the design sort of came back
without any bugs when the

60
00:02:39,680 --> 00:02:42,010
first chip was delivered.

61
00:02:42,010 --> 00:02:45,760
The core frequency was expected
to run at 425--

62
00:02:45,760 --> 00:02:49,870
I think lower than
425 megahertz.

63
00:02:49,870 --> 00:02:52,430
AUDIENCE: Designed for 250?

64
00:02:52,430 --> 00:02:54,460
PROFESSOR RABBAH: 250 megahertz
and came back and it

65
00:02:54,460 --> 00:02:56,350
ran 425 megahertz.

66
00:02:56,350 --> 00:03:01,690
And it's been clocked as high as
500 megahertz at 2.2 volts.

67
00:03:01,690 --> 00:03:05,000
The chip isn't really designed
for low power but the tile

68
00:03:05,000 --> 00:03:08,740
abstraction is really nice for
power consumption because if

69
00:03:08,740 --> 00:03:10,160
you're not using tiles
you can essentially

70
00:03:10,160 --> 00:03:11,230
just shut them down.

71
00:03:11,230 --> 00:03:15,920
So it'll allow you to sort of
have power efficient design

72
00:03:15,920 --> 00:03:17,760
just by nature of the
architecture.

73
00:03:17,760 --> 00:03:20,340
But when you're using all the
tiles, all the memories, all

74
00:03:20,340 --> 00:03:25,330
the networks, in a non-optimized
design, you

75
00:03:25,330 --> 00:03:29,260
consume about 18
watts of power.

76
00:03:29,260 --> 00:03:32,380
So how do you use this
tiled processor?

77
00:03:32,380 --> 00:03:34,400
So here's one particular
example.

78
00:03:34,400 --> 00:03:37,000
The nice thing about tile
architecture is that you can

79
00:03:37,000 --> 00:03:40,470
let applications consume as
many tiles as they need.

80
00:03:40,470 --> 00:03:42,620
If you have an application with
a lot parallelism then

81
00:03:42,620 --> 00:03:44,440
you give it a lot of tiles.

82
00:03:44,440 --> 00:03:46,110
If you have an application that
doesn't need a lot of

83
00:03:46,110 --> 00:03:48,150
parallelism then you don't
give it a lot of tiles.

84
00:03:48,150 --> 00:03:51,540
So it allows you to really
exploit the mapping of your

85
00:03:51,540 --> 00:03:54,350
application down to the
architecture and gives you

86
00:03:54,350 --> 00:03:56,990
ASIC-like behavior-- application
specific

87
00:03:56,990 --> 00:03:59,020
processing technology.

88
00:03:59,020 --> 00:04:02,170
So one example is you have
some video that you're

89
00:04:02,170 --> 00:04:05,355
recording and you want to encode
it and stream it across

90
00:04:05,355 --> 00:04:08,410
the web or display it on your
monitor or whatever else.

91
00:04:08,410 --> 00:04:10,910
So you can have some logic
that you map down.

92
00:04:10,910 --> 00:04:13,330
If your chips are here, you
do some computation.

93
00:04:13,330 --> 00:04:15,940
You have memories sprinkled
across the tile that you're

94
00:04:15,940 --> 00:04:18,090
going to use for local store.

95
00:04:18,090 --> 00:04:23,060
So you can parallelize, for
example, the motion estimation

96
00:04:23,060 --> 00:04:28,940
for encoding the temporal
redundancy in a video stream.

97
00:04:28,940 --> 00:04:31,200
You can have another application
completely

98
00:04:31,200 --> 00:04:34,470
independent running on other
part of the chip.

99
00:04:34,470 --> 00:04:36,250
So here's an application that's
using four different

100
00:04:36,250 --> 00:04:37,460
tiles and it's really
isolated.

101
00:04:37,460 --> 00:04:40,450
It doesn't affect what's going
on in these tiles.

102
00:04:40,450 --> 00:04:42,500
You can have another application
that's running

103
00:04:42,500 --> 00:04:44,590
something like MPI where
you're doing dynamic

104
00:04:44,590 --> 00:04:48,500
messaging, and httpd server
and this tile is maybe not

105
00:04:48,500 --> 00:04:51,580
used so it's just sleeping
or it's idle.

106
00:04:51,580 --> 00:04:52,930
You can have memories
connected off

107
00:04:52,930 --> 00:04:55,010
the chip, I/O devices.

108
00:04:55,010 --> 00:04:58,830
So it's really interesting in
the sense that probably the

109
00:04:58,830 --> 00:05:01,610
most interesting aspect of it is
you just allow the tiles to

110
00:05:01,610 --> 00:05:04,180
sort of be used as your
fundamental resource.

111
00:05:04,180 --> 00:05:04,970
And you can scale
them up as your

112
00:05:04,970 --> 00:05:07,650
application parallelism scales.

113
00:05:07,650 --> 00:05:10,580
This is a picture of the raw
board-- the raw motherboard.

114
00:05:10,580 --> 00:05:13,870
Actually you see it in the Stata
Center in the Raw Lab.

115
00:05:13,870 --> 00:05:15,530
This is the raw chip.

116
00:05:15,530 --> 00:05:18,820
A lot of the peripheral
device, firmware and

117
00:05:18,820 --> 00:05:23,630
interconnect for dealing with a
lot of devices off the chip

118
00:05:23,630 --> 00:05:25,030
are implemented in
these FPGAs, so

119
00:05:25,030 --> 00:05:27,390
these are Xilinx chips.

120
00:05:27,390 --> 00:05:28,570
There's DRAM.

121
00:05:28,570 --> 00:05:33,980
You have connection to a
PCI card, USB stick.

122
00:05:33,980 --> 00:05:36,560
Network interface so you can
actually log into this machine

123
00:05:36,560 --> 00:05:37,810
and use it.

124
00:05:40,600 --> 00:05:42,760
And there's a real compiler.

125
00:05:42,760 --> 00:05:45,890
It can run real applications
on it.

126
00:05:45,890 --> 00:05:49,660
There's actually a bigger chip
that we built where we take

127
00:05:49,660 --> 00:05:52,800
four of these raw chips and
sort of scale them up.

128
00:05:52,800 --> 00:05:55,350
So rather than having 16 tiles
on your motherboard, you can

129
00:05:55,350 --> 00:05:56,400
have four raw chips.

130
00:05:56,400 --> 00:05:57,800
That gives you 64 tiles.

131
00:05:57,800 --> 00:06:00,160
You can scale this up to a
thousand tiles or so on.

132
00:06:00,160 --> 00:06:02,120
Just because of the tile
nature, everything is

133
00:06:02,120 --> 00:06:04,025
symmetric, homogeneous, so
you can really scale

134
00:06:04,025 --> 00:06:06,500
it up really big.

135
00:06:06,500 --> 00:06:08,260
So what is the performance
of raw?

136
00:06:08,260 --> 00:06:12,370
So looking at the overall
application performance, so

137
00:06:12,370 --> 00:06:13,710
we've done a lot of
benchmarking.

138
00:06:13,710 --> 00:06:16,120
So these are numbers from a
paper that was published in

139
00:06:16,120 --> 00:06:19,530
2004, where we took a lot of
applications-- some are

140
00:06:19,530 --> 00:06:21,790
well-known and used in standard
benchmark suites--

141
00:06:21,790 --> 00:06:25,800
and compiled them for raw using
various raw compiler

142
00:06:25,800 --> 00:06:27,750
that we built in-house.

143
00:06:27,750 --> 00:06:30,170
And we've compared them
against the Pentium 3.

144
00:06:30,170 --> 00:06:33,170
So the Pentium 3 is sort of
a unique comparison point

145
00:06:33,170 --> 00:06:36,580
because it sort of matches raw
in terms of the technology

146
00:06:36,580 --> 00:06:38,890
that was used to fabricate
the two.

147
00:06:38,890 --> 00:06:41,310
And what you're seeing here,
this is a lock scale.

148
00:06:41,310 --> 00:06:45,460
The speedup of the application
running on raw compared to the

149
00:06:45,460 --> 00:06:47,110
application running on a P3.

150
00:06:47,110 --> 00:06:50,660
So the higher you get, the
better the performance is.

151
00:06:50,660 --> 00:06:55,770
So these applications are sort
of grouped into a few classes.

152
00:06:55,770 --> 00:06:58,500
So the first class is what
we call ILP applications.

153
00:06:58,500 --> 00:07:01,620
So these are applications that
have essentially instruction

154
00:07:01,620 --> 00:07:02,730
level parallelism.

155
00:07:02,730 --> 00:07:04,680
I'm going to talk a little
bit more about and

156
00:07:04,680 --> 00:07:05,450
sort of explain it.

157
00:07:05,450 --> 00:07:08,040
But you've seen this early
on in the lecture--

158
00:07:08,040 --> 00:07:10,620
in some of Saman's lectures.

159
00:07:10,620 --> 00:07:13,020
So here you're trying to exploit
inherent instruction

160
00:07:13,020 --> 00:07:14,820
level parallelism in
the applications.

161
00:07:14,820 --> 00:07:17,700
And if you have lots of ILP then
you map it to a lot of

162
00:07:17,700 --> 00:07:20,040
tiles and you can get
parallelism that way and you

163
00:07:20,040 --> 00:07:22,110
get better performance.

164
00:07:22,110 --> 00:07:24,850
These applications here-- what
we call the streaming

165
00:07:24,850 --> 00:07:25,530
applications.

166
00:07:25,530 --> 00:07:30,340
So you saw some of these in the
StreamIt lecture and the

167
00:07:30,340 --> 00:07:32,250
StreamIt parallelizer
compiler.

168
00:07:32,250 --> 00:07:34,100
Some of those numbers were
generated on a raw-like

169
00:07:34,100 --> 00:07:35,830
architecture.

170
00:07:35,830 --> 00:07:39,130
And then you have the server
or sort of more traditional

171
00:07:39,130 --> 00:07:42,650
applications that you expect
to run in a server style or

172
00:07:42,650 --> 00:07:45,230
throughput-oriented.

173
00:07:45,230 --> 00:07:47,260
And then finally you have
bit-level applications.

174
00:07:47,260 --> 00:07:50,750
So doing things at the very
lowest level of computation

175
00:07:50,750 --> 00:07:52,900
where you're doing a lot
of bit manipulation.

176
00:07:52,900 --> 00:07:57,830
So what's interesting here to
note is that as you get into

177
00:07:57,830 --> 00:08:00,760
more applications that have a
lot of inherent parallelism in

178
00:08:00,760 --> 00:08:03,830
them, where you want
explicit--

179
00:08:03,830 --> 00:08:06,270
where you can extract a lot of
parallelism because of the

180
00:08:06,270 --> 00:08:07,940
explicit nature of the
applications--

181
00:08:07,940 --> 00:08:09,920
you can really map those
really well through the

182
00:08:09,920 --> 00:08:10,630
architecture.

183
00:08:10,630 --> 00:08:13,850
And because of the communication
nature--

184
00:08:13,850 --> 00:08:15,650
because of communication
capabilities of the

185
00:08:15,650 --> 00:08:18,810
architecture, being able to
stream data from one tile to

186
00:08:18,810 --> 00:08:22,350
another really fast, you can
get really high on-chip

187
00:08:22,350 --> 00:08:24,130
bandwidth and that gives you
really high performance,

188
00:08:24,130 --> 00:08:26,380
especially for these kinds
of applications.

189
00:08:30,120 --> 00:08:32,550
There are other applications
that we've done.

190
00:08:32,550 --> 00:08:34,800
Some of the students have worked
on in the raw group.

191
00:08:34,800 --> 00:08:39,040
So an MPEG-2 encoder where
you're essentially trying to

192
00:08:39,040 --> 00:08:42,450
do real-time encoding of a
video screen at different

193
00:08:42,450 --> 00:08:42,930
resolutions.

194
00:08:42,930 --> 00:08:47,925
So 350 by 240 or 720 by 480
where you're compiling down to

195
00:08:47,925 --> 00:08:49,640
a number of tiles.

196
00:08:49,640 --> 00:08:52,360
One, 4, 8 sixteen, 16--

197
00:08:52,360 --> 00:08:55,410
1 and 16 are somehow missing,
I'm not sure why.

198
00:08:55,410 --> 00:08:57,150
And what you're looking
for here?

199
00:08:57,150 --> 00:08:59,340
Sort of scalability
of algorithm.

200
00:08:59,340 --> 00:09:02,390
As you add more tiles, are
you getting more and more

201
00:09:02,390 --> 00:09:04,780
performance or are you getting
better and better throughput?

202
00:09:04,780 --> 00:09:08,610
So you could encode more frames
per second for example.

203
00:09:08,610 --> 00:09:12,610
So if you're doing HDTV, it's
1080p, then you really want to

204
00:09:12,610 --> 00:09:13,720
sort of get--

205
00:09:13,720 --> 00:09:16,800
there's a lot of compute
power that you need.

206
00:09:16,800 --> 00:09:19,570
And so as you add more frames,
maybe you can get to sort of

207
00:09:19,570 --> 00:09:23,750
the throughput that
you need for HDTV.

208
00:09:23,750 --> 00:09:25,450
So this is something that might
be interesting for some

209
00:09:25,450 --> 00:09:26,460
of your projects as well.

210
00:09:26,460 --> 00:09:28,610
And we've talked about
this before.

211
00:09:28,610 --> 00:09:31,650
On the cell, as you're using
more and more SPEs, can you

212
00:09:31,650 --> 00:09:34,190
accelerate the performance
of your application?

213
00:09:34,190 --> 00:09:35,420
Can you sort of show
that if you're

214
00:09:35,420 --> 00:09:36,640
doing some visual aspect?

215
00:09:36,640 --> 00:09:38,330
And you can sort of
demonstrate it.

216
00:09:38,330 --> 00:09:42,060
So there's a demo that is set
up and in the lab where you

217
00:09:42,060 --> 00:09:44,703
can sort of crank up number of
tiles that you're using and

218
00:09:44,703 --> 00:09:47,850
you get better performance
from the MPEG encoder.

219
00:09:47,850 --> 00:09:50,300
And just looking at number of
frames per second that you can

220
00:09:50,300 --> 00:09:55,520
get, with 64 tiles-- so the raw
chip is 16 tiles, but you

221
00:09:55,520 --> 00:09:58,800
can scale it up by having
more chips--

222
00:09:58,800 --> 00:10:00,920
so you can get about
51 frames.

223
00:10:00,920 --> 00:10:03,230
These numbers have been
improved and there are

224
00:10:03,230 --> 00:10:08,280
different ways of optimizing
these performances.

225
00:10:08,280 --> 00:10:14,990
At 352 by 4 240, the estimated
data rate-- estimated

226
00:10:14,990 --> 00:10:15,210
throughput--

227
00:10:15,210 --> 00:10:17,585
of 160 frames per second
almost. So this

228
00:10:17,585 --> 00:10:20,790
is really high bandwidth.

229
00:10:20,790 --> 00:10:23,790
Another interesting thing that
we've done with the raw chip

230
00:10:23,790 --> 00:10:27,330
is taking a look at graphics
pipelines and looking at is

231
00:10:27,330 --> 00:10:30,740
there anything we can do to
exploit the inherent tiled

232
00:10:30,740 --> 00:10:32,700
architecture of the raw chip.

233
00:10:32,700 --> 00:10:36,190
So here's a screenshot from
Counterstrike and a simplified

234
00:10:36,190 --> 00:10:38,930
graphics pipeline where you have
some input to the screen

235
00:10:38,930 --> 00:10:39,860
you want to render.

236
00:10:39,860 --> 00:10:41,280
You do some vertex shading.

237
00:10:41,280 --> 00:10:43,990
So these are triangles that you
want to figure out what

238
00:10:43,990 --> 00:10:45,810
colors to make--

239
00:10:45,810 --> 00:10:47,730
what colors to paint them.

240
00:10:47,730 --> 00:10:50,210
The triangle's set up
for pixel stage.

241
00:10:50,210 --> 00:10:53,060
And in this screen you'll notice
that there are two

242
00:10:53,060 --> 00:10:54,550
different things that
you're rendering.

243
00:10:54,550 --> 00:10:57,080
There's essentially this part of
the screen which has a lot

244
00:10:57,080 --> 00:11:00,070
of triangles that span
a relatively

245
00:11:00,070 --> 00:11:03,470
not-so-complex image.

246
00:11:03,470 --> 00:11:06,380
And then you have these guys
here that have fewer triangle

247
00:11:06,380 --> 00:11:12,520
span a smaller region
of the frame.

248
00:11:12,520 --> 00:11:14,960
And what you might want to do
is allocate more computer

249
00:11:14,960 --> 00:11:17,960
power to the pixel stage and
less compute power to the

250
00:11:17,960 --> 00:11:18,850
vertex stage.

251
00:11:18,850 --> 00:11:21,660
So that's analogous to saying,
I want more tiles for one

252
00:11:21,660 --> 00:11:24,260
stage of the pipeline and
fewer tiles for another.

253
00:11:24,260 --> 00:11:27,040
Or maybe I want to be able to
dynamically change how many

254
00:11:27,040 --> 00:11:28,430
tiles I'm allocating
at different

255
00:11:28,430 --> 00:11:30,200
stages of the pipeline.

256
00:11:30,200 --> 00:11:33,580
So that as your screens that
you're rendering change in

257
00:11:33,580 --> 00:11:38,120
terms of their complexity, you
can maintain the good visual

258
00:11:38,120 --> 00:11:43,950
illusions transparently without
compromising the

259
00:11:43,950 --> 00:11:45,500
utilization of the chip.

260
00:11:45,500 --> 00:11:47,560
So some demos that were
done with the

261
00:11:47,560 --> 00:11:49,230
graphics group it at MIT--

262
00:11:49,230 --> 00:11:51,250
Fredo Durand's group--

263
00:11:51,250 --> 00:11:52,080
phong shading.

264
00:11:52,080 --> 00:11:55,790
You have 132 vertices
with 1 light source.

265
00:11:55,790 --> 00:11:57,350
So this is what you're
trying to shade.

266
00:11:57,350 --> 00:12:00,410
You have a lot of
regions black.

267
00:12:00,410 --> 00:12:04,110
So if you're looking at a
fixed pipeline where the

268
00:12:04,110 --> 00:12:06,650
vertex shader is taking
six tiles-- this is

269
00:12:06,650 --> 00:12:08,120
on a 64-tile chip--

270
00:12:08,120 --> 00:12:10,920
the rasterizer is taking 15
tiles, the pixel processor has

271
00:12:10,920 --> 00:12:15,150
15 tiles, the alpha buffer
operations has 15 tiles, then

272
00:12:15,150 --> 00:12:18,310
you might not get the best
utilization because for that

273
00:12:18,310 --> 00:12:20,910
entire region that you're
rendering where it's black

274
00:12:20,910 --> 00:12:23,920
there's nothing really
interesting happening there.

275
00:12:23,920 --> 00:12:27,310
You want to shift those tiles
to another processor, to

276
00:12:27,310 --> 00:12:28,770
another stage of pipeline.

277
00:12:28,770 --> 00:12:31,160
Or, if you can't really utilize
them, then you're just

278
00:12:31,160 --> 00:12:33,750
wasting power, wasting energy,
and so you might just want to

279
00:12:33,750 --> 00:12:36,020
shut them and not
use them at all.

280
00:12:36,020 --> 00:12:38,120
So with a fixed pipeline
versus a reconfigurable

281
00:12:38,120 --> 00:12:42,190
pipeline where I can change the
number of tiles allocated

282
00:12:42,190 --> 00:12:44,660
to different stages of the
pipeline, I can get better

283
00:12:44,660 --> 00:12:46,430
utilization.

284
00:12:46,430 --> 00:12:49,250
And, in some cases, better
performance.

285
00:12:49,250 --> 00:12:51,060
So here, fuller bars,
and you're

286
00:12:51,060 --> 00:12:53,540
finishing faster in time.

287
00:12:53,540 --> 00:12:56,540
So this is indicative also
of what's going on in the

288
00:12:56,540 --> 00:12:57,620
graphics industry.

289
00:12:57,620 --> 00:12:59,930
So the graphics card
used to be very--

290
00:13:04,990 --> 00:13:07,930
well, it had fixed resources
allocated to different stage,

291
00:13:07,930 --> 00:13:10,900
which is essentially what we're
trying model in this

292
00:13:10,900 --> 00:13:13,830
part of the experiment, where
more and more now you have

293
00:13:13,830 --> 00:13:16,300
unified shaders that you can use
for the pixel shading and

294
00:13:16,300 --> 00:13:16,870
the vertex shading.

295
00:13:16,870 --> 00:13:19,230
So you're getting into more of
that programmable aspect.

296
00:13:19,230 --> 00:13:21,660
Precisely because you want to
be able to do this kind of

297
00:13:21,660 --> 00:13:24,870
load balancing and exploit
dynamisms that you see in

298
00:13:24,870 --> 00:13:26,530
different things that you're
trying to render.

299
00:13:29,150 --> 00:13:31,270
Another example:
shadow volumes.

300
00:13:31,270 --> 00:13:35,170
You have 4 triangles,
one light source.

301
00:13:35,170 --> 00:13:37,100
And this was rendered
in three passes.

302
00:13:37,100 --> 00:13:41,285
So pass 1, pass 2, pass 3, would
essentially take the

303
00:13:41,285 --> 00:13:45,360
same amount of time because
you're doing the same

304
00:13:45,360 --> 00:13:48,600
computation map to a fixed
number of resources.

305
00:13:48,600 --> 00:13:50,830
But if I can change the number
of resources that I need for

306
00:13:50,830 --> 00:13:53,527
different passes-- so the
rasterizer, for example, and

307
00:13:53,527 --> 00:13:55,690
the alpha buffer operations,
is really where you

308
00:13:55,690 --> 00:13:56,580
need a lot of power.

309
00:13:56,580 --> 00:14:01,910
So if you go from 15 tiles for
each to 20 tiles for each, you

310
00:14:01,910 --> 00:14:04,370
get better execution time
because you were able to

311
00:14:04,370 --> 00:14:06,820
exploit parallelism or match
parallelism better to the

312
00:14:06,820 --> 00:14:07,740
application.

313
00:14:07,740 --> 00:14:09,800
And so you get 40%
percent faster in

314
00:14:09,800 --> 00:14:11,050
this particular case.

315
00:14:13,460 --> 00:14:16,350
And another interesting
application: this is the

316
00:14:16,350 --> 00:14:19,200
largest in the world
microphone array.

317
00:14:19,200 --> 00:14:21,580
It's actually in the Guinness
Book of Records.

318
00:14:21,580 --> 00:14:23,620
It was build in the lab.

319
00:14:23,620 --> 00:14:27,140
And what it essentially has--
each of these little boards

320
00:14:27,140 --> 00:14:28,780
has two microphones on it.

321
00:14:28,780 --> 00:14:30,850
And so what you can
use this for is

322
00:14:30,850 --> 00:14:32,050
eavesdropping for example.

323
00:14:32,050 --> 00:14:35,820
Or you can carry this
around if you want.

324
00:14:35,820 --> 00:14:38,720
Pack it in the car and
do some spying.

325
00:14:38,720 --> 00:14:42,720
But somewhat more interesting
demos that were done with this

326
00:14:42,720 --> 00:14:45,910
in smaller scales was that in a
noisy room, for example, if

327
00:14:45,910 --> 00:14:47,770
you want the sort of hone in.

328
00:14:47,770 --> 00:14:51,280
Let's say everybody here was
speaking, but for the camera

329
00:14:51,280 --> 00:14:52,790
they want to record
only my voice.

330
00:14:52,790 --> 00:14:56,170
They can have a microphone array
in the back that focuses

331
00:14:56,170 --> 00:14:57,370
on just my voice.

332
00:14:57,370 --> 00:15:01,190
And the way it's done is you can
measure the distance from

333
00:15:01,190 --> 00:15:03,427
the time it takes for the sound
wave to reach each of

334
00:15:03,427 --> 00:15:05,930
these different microphones
and you can focus in on a

335
00:15:05,930 --> 00:15:10,950
particular source of noise
and be able to

336
00:15:10,950 --> 00:15:13,510
just highlight that.

337
00:15:13,510 --> 00:15:15,380
So there's this demo where's
it's a noisy room--

338
00:15:15,380 --> 00:15:18,470
I probably should have had these
in here in retrospect--

339
00:15:18,470 --> 00:15:21,750
there's a noisy room, lots of
people are talking, then you

340
00:15:21,750 --> 00:15:24,180
turn on the microphone array
and you can hear that one

341
00:15:24,180 --> 00:15:26,000
particular source and
it's a lot clearer.

342
00:15:26,000 --> 00:15:30,740
You can also have applications
where you're tracking a person

343
00:15:30,740 --> 00:15:33,060
in a room with videos as
well, so you can sort

344
00:15:33,060 --> 00:15:34,850
of follow him around.

345
00:15:34,850 --> 00:15:36,340
So it's a very interesting
application.

346
00:15:36,340 --> 00:15:39,620
An now I regret not having
the video demo in here.

347
00:15:39,620 --> 00:15:40,200
Actually, should I do it?

348
00:15:40,200 --> 00:15:41,550
It's on the Web.

349
00:15:41,550 --> 00:15:42,890
OK.

350
00:15:42,890 --> 00:15:45,990
So a case study using
the beamformer.

351
00:15:45,990 --> 00:15:49,290
So what's being done in the
microphone array is you're

352
00:15:49,290 --> 00:15:50,130
doing beamforming.

353
00:15:50,130 --> 00:15:53,270
So you're trying to figure out
what are the different beams

354
00:15:53,270 --> 00:15:54,290
that are reaching
the microphone.

355
00:15:54,290 --> 00:15:57,280
You want to be able to
amplify one of them.

356
00:15:57,280 --> 00:16:02,650
So looking at the application
written natively in C running

357
00:16:02,650 --> 00:16:06,050
on a 1 gigahertz Pentium
, what is the operation

358
00:16:06,050 --> 00:16:06,620
throughput?

359
00:16:06,620 --> 00:16:10,470
So you're getting about
240 MegaFLOPS.

360
00:16:10,470 --> 00:16:14,520
And if you go down
to an optimized--

361
00:16:14,520 --> 00:16:17,700
same code but running on single
tile raw chip, you get

362
00:16:17,700 --> 00:16:19,190
about 19 MegaFLOPS.

363
00:16:19,190 --> 00:16:20,480
So a not very good
performance.

364
00:16:20,480 --> 00:16:22,530
But here, what you really want
to do, is you have al lot of

365
00:16:22,530 --> 00:16:23,200
parallelism.

366
00:16:23,200 --> 00:16:25,580
Because each of those beams
that's reaching individual

367
00:16:25,580 --> 00:16:27,660
microphones can be
done in parallel.

368
00:16:27,660 --> 00:16:29,170
So you have a lot of
parallelism in that

369
00:16:29,170 --> 00:16:29,830
application.

370
00:16:29,830 --> 00:16:33,350
So taking the C program,
reimplementing it in StreamIt

371
00:16:33,350 --> 00:16:36,060
that you've seen in previous
lectures, and not really

372
00:16:36,060 --> 00:16:38,180
optimizing it in terms
of doing a lot of the

373
00:16:38,180 --> 00:16:43,360
optimizations you saw in the
parallelizing compiler talk,

374
00:16:43,360 --> 00:16:44,600
you get about 640 MegaFLOPS.

375
00:16:44,600 --> 00:16:49,810
So already you're beating the C
program running on a pretty

376
00:16:49,810 --> 00:16:51,800
fast superscalar machine.

377
00:16:51,800 --> 00:16:54,240
And if you really optimize the
StreamIt code in terms of

378
00:16:54,240 --> 00:16:59,030
doing the fission and fusion,
increasing the parallelism,

379
00:16:59,030 --> 00:17:01,560
doing better load balancing
automatically, you can get up

380
00:17:01,560 --> 00:17:03,350
to 1.4 GigaFLOPS.

381
00:17:03,350 --> 00:17:06,420
So really good performance and
really matching the inherent

382
00:17:06,420 --> 00:17:07,800
parallelism to the
architecture.

383
00:17:10,380 --> 00:17:13,510
So it was just a big overview of
the raw chip and what we've

384
00:17:13,510 --> 00:17:14,820
done with it in lab.

385
00:17:14,820 --> 00:17:17,620
There's more in here than
I've talked about.

386
00:17:17,620 --> 00:17:20,430
But what I'm going to do next
is give you some insights as

387
00:17:20,430 --> 00:17:22,700
to what is the design philosophy
that went into raw

388
00:17:22,700 --> 00:17:27,000
architecture, why was it
designed the way it was.

389
00:17:27,000 --> 00:17:28,810
And then I'm going to talk a
little bit about the raw

390
00:17:28,810 --> 00:17:30,310
parallelizing compiler.

391
00:17:30,310 --> 00:17:33,550
And while the StreamIt language
and compiler also has

392
00:17:33,550 --> 00:17:36,190
a back end for the raw
architecture, we've sort of

393
00:17:36,190 --> 00:17:37,300
seen that in previous lectures
so I'm not going to

394
00:17:37,300 --> 00:17:38,580
talk about that here.

395
00:17:38,580 --> 00:17:42,520
So I'm just going to focus
on the first two bullets.

396
00:17:42,520 --> 00:17:47,680
And a few years ago when the
project got started, sort of

397
00:17:47,680 --> 00:17:52,430
the insight in the wide issue
processors and the design

398
00:17:52,430 --> 00:17:55,580
philosophy that was being
followed in industry for

399
00:17:55,580 --> 00:17:58,760
building wider superscalars,
faster superscalars, was

400
00:17:58,760 --> 00:18:01,560
really going to come to a halt
largely because you have

401
00:18:01,560 --> 00:18:03,340
scalability issues.

402
00:18:03,340 --> 00:18:06,840
So if you look at sort of a
simplified illustration of a

403
00:18:06,840 --> 00:18:10,210
wide issue microprocessor, you
have your program counter such

404
00:18:10,210 --> 00:18:11,070
as instructions.

405
00:18:11,070 --> 00:18:12,740
Goes into some control logic.

406
00:18:12,740 --> 00:18:14,200
Control logic is then
going to run.

407
00:18:14,200 --> 00:18:15,370
You're going to read
some variables from

408
00:18:15,370 --> 00:18:17,210
the register file.

409
00:18:17,210 --> 00:18:19,700
You'll have a big crossbar in
the middle that routes to

410
00:18:19,700 --> 00:18:20,570
operands like ALUs.

411
00:18:20,570 --> 00:18:23,600
Yell And then you operate on
those and you have to send it

412
00:18:23,600 --> 00:18:25,430
back to the register file.

413
00:18:25,430 --> 00:18:30,220
Plus you have this really big
problem with the network.

414
00:18:30,220 --> 00:18:32,850
So if you're doing some
computation--

415
00:18:32,850 --> 00:18:35,110
sorry, I rearranged
these slides.

416
00:18:35,110 --> 00:18:38,290
So what you have if you have n
ALUs, then the complexity of

417
00:18:38,290 --> 00:18:41,660
your crossbar increases as
n squared, because you

418
00:18:41,660 --> 00:18:42,900
essentially have to
have everybody

419
00:18:42,900 --> 00:18:44,880
talking to each other.

420
00:18:44,880 --> 00:18:47,260
And in terms of the number of
wires that you need out of the

421
00:18:47,260 --> 00:18:49,970
register file to support
everybody being able to sort

422
00:18:49,970 --> 00:18:52,470
of talk to anybody else very
efficiently, the number of

423
00:18:52,470 --> 00:18:54,970
ports, the number of wires
increases n cubed.

424
00:18:54,970 --> 00:18:57,600
So that's a problem because
you can't clock all those

425
00:18:57,600 --> 00:18:59,150
wires fast enough.

426
00:18:59,150 --> 00:19:01,410
The frequency becomes
sort of limited.

427
00:19:01,410 --> 00:19:04,380
It grows even less
than linearly.

428
00:19:04,380 --> 00:19:08,230
And this is a problem because
operational routing-- operand

429
00:19:08,230 --> 00:19:09,620
routing, is global.

430
00:19:09,620 --> 00:19:12,900
So if I have- I'm doing some
operations and it's an add,

431
00:19:12,900 --> 00:19:16,760
the results of this add is fed
to another operation to shift,

432
00:19:16,760 --> 00:19:19,660
and these are going to execute
on two different ALUs.

433
00:19:19,660 --> 00:19:22,860
So what's going to happen?

434
00:19:22,860 --> 00:19:24,410
I do the add operation.

435
00:19:24,410 --> 00:19:26,100
It's going to produce
a result.

436
00:19:26,100 --> 00:19:30,100
But there's no direct path for
this ALU to send this result

437
00:19:30,100 --> 00:19:30,560
to this ALU.

438
00:19:30,560 --> 00:19:33,530
So instead what has happened is
the operand has to travel

439
00:19:33,530 --> 00:19:36,195
all the way back around through
the crossbar and then

440
00:19:36,195 --> 00:19:37,445
back to this ALU.

441
00:19:39,700 --> 00:19:43,210
So that's really just going to
take a long time and not

442
00:19:43,210 --> 00:19:44,300
necessarily very efficient.

443
00:19:44,300 --> 00:19:48,140
And if you're doing this for a
lot of ALU operations, you

444
00:19:48,140 --> 00:19:49,780
have a lot of parallelism in
your application level,

445
00:19:49,780 --> 00:19:51,750
instructional level parallelism,
and that's just

446
00:19:51,750 --> 00:19:53,300
creating a lot of
communication.

447
00:19:53,300 --> 00:19:55,170
But you're not really exploiting
the locality of the

448
00:19:55,170 --> 00:19:56,830
computation.

449
00:19:56,830 --> 00:19:59,440
If 2 instructions are really
close together, you want to be

450
00:19:59,440 --> 00:20:01,890
able to just have a
point-to-point path, for

451
00:20:01,890 --> 00:20:05,050
example, or a shorter path that
allows you to exploit

452
00:20:05,050 --> 00:20:07,920
where was instructions
are in space.

453
00:20:07,920 --> 00:20:11,220
And so this was the driving
insight for the architecture

454
00:20:11,220 --> 00:20:14,850
in that you want to make
operand routing local.

455
00:20:14,850 --> 00:20:18,110
So an idea is to essentially
exploit this locality by

456
00:20:18,110 --> 00:20:19,730
distributing the ALUs.

457
00:20:19,730 --> 00:20:22,570
And rather than having that
massive crossbar, what you

458
00:20:22,570 --> 00:20:25,660
want to do is have an on-chip
mesh network.

459
00:20:25,660 --> 00:20:28,393
So rather than have one big
crossbar, you have lots of

460
00:20:28,393 --> 00:20:29,060
smaller ones.

461
00:20:29,060 --> 00:20:31,040
So these become switch
processors.

462
00:20:31,040 --> 00:20:34,750
So I can put value from this
ALU here and then have that

463
00:20:34,750 --> 00:20:37,350
value routed to any other ALU.

464
00:20:37,350 --> 00:20:39,990
Maybe that just cost me more in
terms of instructions that

465
00:20:39,990 --> 00:20:42,770
says where this operand
is going.

466
00:20:42,770 --> 00:20:44,320
We'll get into that.

467
00:20:44,320 --> 00:20:46,580
But here, what this allows
me to do is exploit

468
00:20:46,580 --> 00:20:47,790
that locality better.

469
00:20:47,790 --> 00:20:51,650
Same instruction chain, I can
put the first operation on one

470
00:20:51,650 --> 00:20:55,950
ALU, I can put the other
operation on the second ALU.

471
00:20:55,950 --> 00:20:58,240
And here, rather than putting
it for example here, which

472
00:20:58,240 --> 00:21:01,230
would send the operand really
far across chip, what I want

473
00:21:01,230 --> 00:21:03,660
to do is recognize that there's
a producer/consumer

474
00:21:03,660 --> 00:21:04,800
relationship here.

475
00:21:04,800 --> 00:21:07,245
I want to exploit that locality
and have them close

476
00:21:07,245 --> 00:21:11,260
in spaces so that the routes
remain fairly short.

477
00:21:11,260 --> 00:21:13,890
You know what I can also do is
sort of pipeline this network

478
00:21:13,890 --> 00:21:16,770
so that I can have the hardware
essential match

479
00:21:16,770 --> 00:21:18,530
computation flow.

480
00:21:18,530 --> 00:21:22,900
If one ALU is producing a lot
of results at a lot faster

481
00:21:22,900 --> 00:21:25,680
rate than for example this
instruction can consume them,

482
00:21:25,680 --> 00:21:29,470
then the hardware can take care
of, for example, blocking

483
00:21:29,470 --> 00:21:32,680
or stalling the producing
processor so it doesn't get

484
00:21:32,680 --> 00:21:33,650
too far ahead.

485
00:21:33,650 --> 00:21:36,490
It gives you a nature mechanism
for regulating the

486
00:21:36,490 --> 00:21:39,940
flow data on the chip.

487
00:21:39,940 --> 00:21:44,680
Well, this is better than what
we saw before because with the

488
00:21:44,680 --> 00:21:47,260
crossbar you're not really
getting any scalability in

489
00:21:47,260 --> 00:21:50,670
terms of your latency transport
operands from one

490
00:21:50,670 --> 00:21:53,380
ALU to another.

491
00:21:53,380 --> 00:21:56,790
Whereas with on-chip network,
if you've taken routing

492
00:21:56,790 --> 00:21:59,340
classes, you know that there
exists an algorithm that sort

493
00:21:59,340 --> 00:22:03,030
of allows it to route things at
least the square root of n,

494
00:22:03,030 --> 00:22:05,170
where n is the number of things
that are communicating

495
00:22:05,170 --> 00:22:06,380
in your network.

496
00:22:06,380 --> 00:22:08,560
But if you're doing locality
driven placement then it's

497
00:22:08,560 --> 00:22:10,040
essentially costing time.

498
00:22:10,040 --> 00:22:12,190
And in a raw chip, it's
in fact three cycles.

499
00:22:12,190 --> 00:22:15,220
So you can send one operand from
one tile to another in

500
00:22:15,220 --> 00:22:15,780
three cycles.

501
00:22:15,780 --> 00:22:18,730
And we'll get into how that
number comes about.

502
00:22:18,730 --> 00:22:19,830
So this is much better.

503
00:22:19,830 --> 00:22:21,450
But what it does
is increase the

504
00:22:21,450 --> 00:22:22,750
complexity on the compiler.

505
00:22:22,750 --> 00:22:25,780
It says, this is my computation,
how do you map it

506
00:22:25,780 --> 00:22:28,960
efficiently so that things are
clustered in space well so

507
00:22:28,960 --> 00:22:33,880
that I don't have these really
long routes for communication?

508
00:22:33,880 --> 00:22:36,190
But then we can look at what
else can we distribute.

509
00:22:36,190 --> 00:22:38,640
Well, we have the
register file.

510
00:22:38,640 --> 00:22:41,240
We can distribute that
across all the ALUs.

511
00:22:41,240 --> 00:22:44,500
And that essentially decreases
that n cubed relationships

512
00:22:44,500 --> 00:22:47,980
between ALUs and register file
ports to something that's a

513
00:22:47,980 --> 00:22:49,130
lot more tractable.

514
00:22:49,130 --> 00:22:54,170
Where it's one small
register per ALU.

515
00:22:54,170 --> 00:22:57,370
And this is better in terms of
scalability, but we haven't

516
00:22:57,370 --> 00:22:59,870
solved the entire problem in
that we still have one global

517
00:22:59,870 --> 00:23:03,390
program counter, we have one
global instruction fetch unit,

518
00:23:03,390 --> 00:23:07,240
one global control unified
load/store queue for

519
00:23:07,240 --> 00:23:08,600
communicating with memory.

520
00:23:08,600 --> 00:23:13,850
And those all have scalability
problems. So whereas we fixed

521
00:23:13,850 --> 00:23:15,360
the problem with
the crossbar--

522
00:23:15,360 --> 00:23:17,840
that becomes more scalable--

523
00:23:17,840 --> 00:23:19,940
we haven't really fix the
problems with the others.

524
00:23:19,940 --> 00:23:22,530
So what's the natural
solution to do here?

525
00:23:22,530 --> 00:23:26,250
Well, we'll just distribute
everything else.

526
00:23:26,250 --> 00:23:30,090
And so you start off with each
ALU here now having it's own

527
00:23:30,090 --> 00:23:32,610
program counter, its own
instruction cache, it's own

528
00:23:32,610 --> 00:23:33,540
data cache.

529
00:23:33,540 --> 00:23:37,610
And it has its register file ALU
and everybody-- that same

530
00:23:37,610 --> 00:23:40,560
sort of design pattern
is repeated for each

531
00:23:40,560 --> 00:23:41,840
one of those ALUs.

532
00:23:41,840 --> 00:23:44,340
So now it looks like it's
a lot scalable.

533
00:23:44,340 --> 00:23:46,320
I don't have any global wires.

534
00:23:46,320 --> 00:23:49,100
There's no global centralized
data structure.

535
00:23:49,100 --> 00:23:52,220
And all of that means I
can do things more--

536
00:23:52,220 --> 00:23:55,600
I can do things faster
and more efficiently.

537
00:23:55,600 --> 00:23:58,990
And what you start seeing here
is this sort of tile processor

538
00:23:58,990 --> 00:24:00,110
coming about all.

539
00:24:00,110 --> 00:24:03,920
So each one of those things
was exactly the same.

540
00:24:03,920 --> 00:24:06,470
And what was done in the raw
processor is that none of

541
00:24:06,470 --> 00:24:09,880
those tiles was longer than
you can communicate in one

542
00:24:09,880 --> 00:24:10,710
clock cycle.

543
00:24:10,710 --> 00:24:14,850
So this solved essentially a
wire delay problem as well.

544
00:24:14,850 --> 00:24:17,600
So if this is the distance
that a wire--

545
00:24:17,600 --> 00:24:19,340
that a signal can travel
in one clock

546
00:24:19,340 --> 00:24:21,970
cycle, the tile is smaller.

547
00:24:21,970 --> 00:24:23,810
It can fit within this circle.

548
00:24:23,810 --> 00:24:26,820
So that means that you're
guaranteed--

549
00:24:26,820 --> 00:24:29,200
you have better scalability
problems. You're solving the

550
00:24:29,200 --> 00:24:32,860
issues that people are facing
with wire delay.

551
00:24:32,860 --> 00:24:36,940
And in terms of the tile
processor abstraction, Michael

552
00:24:36,940 --> 00:24:41,680
Taylor was is a PhD student in
the raw group, his thesis sort

553
00:24:41,680 --> 00:24:45,780
of identified the tile processor
approach and this

554
00:24:45,780 --> 00:24:48,000
aspect of the tile processor
approach that makes it more

555
00:24:48,000 --> 00:24:50,880
attractive, the SON.

556
00:24:50,880 --> 00:24:52,990
Which is the scalar
operand network.

557
00:24:52,990 --> 00:24:57,080
And the next two slides, the
next part of the lecture, is

558
00:24:57,080 --> 00:25:00,120
going to really focus
on what that means.

559
00:25:00,120 --> 00:25:02,170
He argues why the tile

560
00:25:02,170 --> 00:25:05,340
processor approach is scalable.

561
00:25:05,340 --> 00:25:07,160
And it's scalable for the same
reasons as multicores.

562
00:25:07,160 --> 00:25:09,350
You just add more and more
cores on a chip.

563
00:25:09,350 --> 00:25:13,910
But the intrinsic difference
between the multicore that you

564
00:25:13,910 --> 00:25:16,580
see today and the raw
architecture is the scalar

565
00:25:16,580 --> 00:25:18,150
operand network.

566
00:25:18,150 --> 00:25:20,960
So I'm going to ask you
questions about

567
00:25:20,960 --> 00:25:22,690
this in a few slides.

568
00:25:22,690 --> 00:25:25,620
But really what you're getting
here is the ability to

569
00:25:25,620 --> 00:25:28,980
communicate from one processor
to another very efficiently.

570
00:25:28,980 --> 00:25:31,990
And the way you do this on raw
is you have your instruction

571
00:25:31,990 --> 00:25:35,340
fetch d code, register file
read stage, WALU--

572
00:25:35,340 --> 00:25:38,090
your competition pipeline.

573
00:25:38,090 --> 00:25:41,430
But part of the registers-- the
new register file-- so 24

574
00:25:41,430 --> 00:25:43,960
through 27 are network mapped.

575
00:25:43,960 --> 00:25:46,890
So what that means is, if
I write-- if one of the

576
00:25:46,890 --> 00:25:51,800
operations that I have in my
computation has a destination

577
00:25:51,800 --> 00:25:56,480
register that's 24, 25, 26 or
27, that value automatically

578
00:25:56,480 --> 00:25:59,360
get sent to the output
network.

579
00:25:59,360 --> 00:26:01,150
And if I have a value--

580
00:26:01,150 --> 00:26:04,960
if one of my source operands is
registered at 24, 25, 26 or

581
00:26:04,960 --> 00:26:08,340
27, implicitly that means get
that value off the network.

582
00:26:12,010 --> 00:26:15,780
And so I can have add 25--

583
00:26:15,780 --> 00:26:18,560
added to register 25-- so this
is one of the network map

584
00:26:18,560 --> 00:26:20,760
ports, sum two operands.

585
00:26:20,760 --> 00:26:23,150
So this is a picture
of the raw chip.

586
00:26:23,150 --> 00:26:25,100
This is one tile.

587
00:26:25,100 --> 00:26:26,760
This is the other tile.

588
00:26:26,760 --> 00:26:30,250
So you can sort of see the
computation and the network

589
00:26:30,250 --> 00:26:32,110
switch processor here.

590
00:26:32,110 --> 00:26:36,340
So the operand flows into the
network and then gets

591
00:26:36,340 --> 00:26:39,360
transported across from
one tile to the other.

592
00:26:39,360 --> 00:26:40,800
And then gets injected
into the other

593
00:26:40,800 --> 00:26:43,270
tiles compute networks.

594
00:26:43,270 --> 00:26:46,700
And here this instruction has
sort a source operand that

595
00:26:46,700 --> 00:26:48,250
that's register map operand.

596
00:26:48,250 --> 00:26:49,730
So it knows where to
get its value from.

597
00:26:49,730 --> 00:26:51,830
And then you can do
the computation.

598
00:26:51,830 --> 00:26:55,200
An interesting aspect here
is that while you've seen

599
00:26:55,200 --> 00:26:58,080
instructions like this, just
normal instructions, here you

600
00:26:58,080 --> 00:27:02,220
also have explicit routing
instructions that are executed

601
00:27:02,220 --> 00:27:04,330
on the switch processor.

602
00:27:04,330 --> 00:27:06,960
So the switch processor here
says take the value that's

603
00:27:06,960 --> 00:27:11,990
coming from my processor and
send it east. So each

604
00:27:11,990 --> 00:27:15,360
processor can send values east,
west, north or south.

605
00:27:15,360 --> 00:27:17,950
So it can go to the tile above
it, the tile below it, the

606
00:27:17,950 --> 00:27:20,650
tile to the left of it or
tile to the right of it.

607
00:27:20,650 --> 00:27:24,290
And so sending it east sends
it along this wire here.

608
00:27:24,290 --> 00:27:27,120
And then this particular switch
processor says get a

609
00:27:27,120 --> 00:27:30,910
value from the west port and
send it to my processor.

610
00:27:30,910 --> 00:27:33,970
Now you could have had here,
this process could say, this

611
00:27:33,970 --> 00:27:37,060
value is not for me, so I want
to just pass through to some

612
00:27:37,060 --> 00:27:37,980
other processor.

613
00:27:37,980 --> 00:27:40,770
So you can pass it from the west
port to the south port or

614
00:27:40,770 --> 00:27:44,170
to the north port or just pass
it through laterally to the

615
00:27:44,170 --> 00:27:46,530
other east port.

616
00:27:46,530 --> 00:27:48,000
So it just allows you to
essentially just have an

617
00:27:48,000 --> 00:27:50,480
on-chip network and not
operand-- you can imagine

618
00:27:50,480 --> 00:27:55,040
having an operand that has a
data packet and header that's

619
00:27:55,040 --> 00:27:58,540
says, I'm going to tile 10
and the switches know

620
00:27:58,540 --> 00:27:59,510
which way to send it.

621
00:27:59,510 --> 00:28:01,700
But the interesting aspect
here is that the compiler

622
00:28:01,700 --> 00:28:04,060
actually orchestrates the
communication, so you don't

623
00:28:04,060 --> 00:28:06,612
need that extra header that
says, I'm going to tile 10.

624
00:28:06,612 --> 00:28:09,380
You just have to generate a
schedule of how to write that

625
00:28:09,380 --> 00:28:11,250
data through.

626
00:28:11,250 --> 00:28:13,170
So we'll get into what that
means for the compiler in

627
00:28:13,170 --> 00:28:16,140
terms of that added
complexity.

628
00:28:16,140 --> 00:28:19,630
So communication on multicores
is expensive for

629
00:28:19,630 --> 00:28:20,640
the following reasons.

630
00:28:20,640 --> 00:28:24,400
And this is really sort of going
contrast or going to put

631
00:28:24,400 --> 00:28:26,360
the scalar operand network
into slightly more

632
00:28:26,360 --> 00:28:27,450
perspective.

633
00:28:27,450 --> 00:28:31,480
But first, so how do you
communicate between multicores

634
00:28:31,480 --> 00:28:32,650
on the cell?

635
00:28:32,650 --> 00:28:36,510
You have the DMA transfers
from one SPE to another.

636
00:28:36,510 --> 00:28:39,570
You can't really ship an
operand single value.

637
00:28:39,570 --> 00:28:43,030
So if I write the value x, and
I want to send x from one SPE

638
00:28:43,030 --> 00:28:46,790
to another, I can't really do
that very efficiently, right?

639
00:28:46,790 --> 00:28:52,140
So this is essentially the
contrasting thing between

640
00:28:52,140 --> 00:28:55,320
multicore processors that
largely exist today and the

641
00:28:55,320 --> 00:28:56,350
raw processor.

642
00:28:56,350 --> 00:29:00,210
So I've shown you an empirical--
a quantitative--

643
00:29:00,210 --> 00:29:04,170
an analytical model for
communication costs before in

644
00:29:04,170 --> 00:29:06,380
earlier slides.

645
00:29:06,380 --> 00:29:08,740
This is an illustration
of that concept.

646
00:29:08,740 --> 00:29:12,370
So if I have a processor that's
talking to another,

647
00:29:12,370 --> 00:29:16,230
that value has to travel
across some network and

648
00:29:16,230 --> 00:29:18,940
there's some transport costs
associated with that.

649
00:29:18,940 --> 00:29:20,590
But there's also some
added complexities.

650
00:29:20,590 --> 00:29:22,760
So there were lots of terms,
if you remember, in that

651
00:29:22,760 --> 00:29:25,730
really big equation
I've shown before.

652
00:29:25,730 --> 00:29:29,100
You have some overhead in terms
of packaging the data.

653
00:29:29,100 --> 00:29:32,040
And you have some overhead in
terms of unpacking the data.

654
00:29:32,040 --> 00:29:33,420
So what does that look?

655
00:29:33,420 --> 00:29:36,580
Well, there are two components
we're going to break this down

656
00:29:36,580 --> 00:29:39,020
to: the send occupancy
and send latency.

657
00:29:39,020 --> 00:29:40,530
And I'm going to talk
about each of those.

658
00:29:40,530 --> 00:29:43,050
And similarly on the receive
side, you have the receive

659
00:29:43,050 --> 00:29:45,640
latency and the receive
occupancy.

660
00:29:45,640 --> 00:29:50,400
So bear in mind, this lifetime
of a message essentially has

661
00:29:50,400 --> 00:29:52,820
to flow through these
five components.

662
00:29:52,820 --> 00:29:55,810
It has to go through the
occupancy stage, then there's

663
00:29:55,810 --> 00:29:59,810
the send latency, transport,
receive latency and receive

664
00:29:59,810 --> 00:30:04,830
occupancy before you can
actually use it to compute on.

665
00:30:04,830 --> 00:30:06,670
So what are some things
that you do here?

666
00:30:06,670 --> 00:30:09,900
Well, it's things that you've
done on cell for getting VME

667
00:30:09,900 --> 00:30:10,890
transfers to work.

668
00:30:10,890 --> 00:30:14,040
You have to figure who the
destination is, what is the

669
00:30:14,040 --> 00:30:17,800
value, maybe you have an idea
associated with it, a tag,

670
00:30:17,800 --> 00:30:18,630
things of that sort.

671
00:30:18,630 --> 00:30:20,120
And you have to essentially
inject that

672
00:30:20,120 --> 00:30:22,530
message into the network.

673
00:30:22,530 --> 00:30:24,210
So there's some latency
associated with that.

674
00:30:24,210 --> 00:30:26,370
Maybe your--

675
00:30:26,370 --> 00:30:31,480
on cell you have a DMA engine
which essentially hides this

676
00:30:31,480 --> 00:30:32,510
latency for you.

677
00:30:32,510 --> 00:30:34,520
Because you can essentially just
send the message to the

678
00:30:34,520 --> 00:30:36,110
DMA, right into its queue.

679
00:30:36,110 --> 00:30:39,530
And you can especially forget
about it unless it stalls

680
00:30:39,530 --> 00:30:43,340
because the DMA list is full.

681
00:30:43,340 --> 00:30:45,890
On the receive side, you sort
of have a similar thing.

682
00:30:45,890 --> 00:30:49,810
You have to get the network to
inject that value into the

683
00:30:49,810 --> 00:30:53,005
processor and then you have to
depackage it, demultiplex it

684
00:30:53,005 --> 00:30:55,960
and put it into some form that
you can actually use to

685
00:30:55,960 --> 00:30:57,670
operate on it.

686
00:30:57,670 --> 00:31:01,700
So this 5-tuple is gives us a
way of sort of characterizing

687
00:31:01,700 --> 00:31:05,570
communication patterns on
different architectures.

688
00:31:05,570 --> 00:31:09,530
So I can contrast, for example,
raw versus the

689
00:31:09,530 --> 00:31:12,520
traditional microprocessor.

690
00:31:12,520 --> 00:31:15,460
So this is a traditional
superscalar.

691
00:31:15,460 --> 00:31:18,800
A traditional superscalar
essentially has all the

692
00:31:18,800 --> 00:31:22,200
sophisticated circuitry that
allows you to essentially

693
00:31:22,200 --> 00:31:23,660
bypass network.

694
00:31:23,660 --> 00:31:26,020
You can have an operand directly
flowing to another

695
00:31:26,020 --> 00:31:29,950
ALU through all the n squared
wires in the crossbar.

696
00:31:29,950 --> 00:31:33,320
And a lot of dynamic scheduling
is going on.

697
00:31:33,320 --> 00:31:37,110
So it really has no occupancy,
latency, you're not really

698
00:31:37,110 --> 00:31:39,470
doing any packaging
of the operands.

699
00:31:39,470 --> 00:31:43,460
Your transport cost is
essentially completely hidden.

700
00:31:43,460 --> 00:31:46,000
You have no complexity
on the receive side.

701
00:31:46,000 --> 00:31:47,540
So it's really efficient.

702
00:31:47,540 --> 00:31:50,140
So this is essentially what you
want to get to go: this

703
00:31:50,140 --> 00:31:51,250
kind of 5-tuple.

704
00:31:51,250 --> 00:31:54,170
But as we saw before, it's
really not scalable because

705
00:31:54,170 --> 00:31:57,460
the wire complexity woes--
whether it's n squared or n

706
00:31:57,460 --> 00:31:59,480
cubed, that's not good
from an energy

707
00:31:59,480 --> 00:32:01,150
efficient point of view.

708
00:32:01,150 --> 00:32:02,340
Scalable multiprocessors--

709
00:32:02,340 --> 00:32:05,580
these are on-chip
multiprocessors more

710
00:32:05,580 --> 00:32:08,770
indicative of things that you
have today-- have this kind of

711
00:32:08,770 --> 00:32:12,210
5-tuple where you have about
16 cycles just to get a

712
00:32:12,210 --> 00:32:15,355
message out, know roughly
3 cycles are so

713
00:32:15,355 --> 00:32:16,890
to transport message.

714
00:32:16,890 --> 00:32:19,370
So maybe this is being done
through a shared cache.

715
00:32:19,370 --> 00:32:22,120
Which is how a lot of
architecture communicates

716
00:32:22,120 --> 00:32:23,300
between processors today.

717
00:32:23,300 --> 00:32:26,970
And you have to sort of
demultiplex the message on the

718
00:32:26,970 --> 00:32:28,130
receive side.

719
00:32:28,130 --> 00:32:30,280
So that adds some latency.

720
00:32:30,280 --> 00:32:34,580
In raw, because you have these
net memory map registers on

721
00:32:34,580 --> 00:32:37,210
the input side and the output
side, you really can knock

722
00:32:37,210 --> 00:32:44,790
down the complexity from the
send side in terms of the

723
00:32:44,790 --> 00:32:46,770
occupancy and latency to zero.

724
00:32:46,770 --> 00:32:48,610
And you just write the values
to the register.

725
00:32:48,610 --> 00:32:50,490
And it looks like a normal
register, right?

726
00:32:50,490 --> 00:32:53,500
But it just magically appears
on the network.

727
00:32:53,500 --> 00:32:56,380
And then from one tile to
another, it's one cycle to

728
00:32:56,380 --> 00:32:59,380
ship the value across that one
link from one switch processor

729
00:32:59,380 --> 00:33:02,020
to the other, as long as
it's a near neighbor.

730
00:33:02,020 --> 00:33:04,080
And then two cycles to
inject the network

731
00:33:04,080 --> 00:33:05,820
into the tile processor.

732
00:33:05,820 --> 00:33:08,270
And then you're ready
to use it.

733
00:33:08,270 --> 00:33:12,790
So in this space, where would
you put cell is the question?

734
00:33:12,790 --> 00:33:14,310
Anybody have any ideas?

735
00:33:19,670 --> 00:33:21,790
What would the communication
panel look like on cell?

736
00:33:27,960 --> 00:33:30,930
So you have to do explicit
sends and receives.

737
00:33:30,930 --> 00:33:35,450
So let's look at this.

738
00:33:35,450 --> 00:33:38,000
So can we get rid of this
stage on cell which is

739
00:33:38,000 --> 00:33:40,160
essentially saying
packaging up my

740
00:33:40,160 --> 00:33:42,190
message, is it's no, right?

741
00:33:42,190 --> 00:33:44,500
Because you have to essentially
say where that DMA

742
00:33:44,500 --> 00:33:46,680
transfer is going to go to--
which region of memory?

743
00:33:46,680 --> 00:33:49,670
So you're buildings these
control blocks.

744
00:33:49,670 --> 00:33:54,230
And then the send latency here
is roughly zero, because you

745
00:33:54,230 --> 00:33:56,090
have the DMA processor which
allows that kind of

746
00:33:56,090 --> 00:33:58,830
concurrency between
communication and computation,

747
00:33:58,830 --> 00:34:03,560
so you can hide essentially that
part of the transport--

748
00:34:03,560 --> 00:34:05,760
that part of communication
costs.

749
00:34:05,760 --> 00:34:09,210
Your transport costs here, you
have this really massive

750
00:34:09,210 --> 00:34:10,860
bandwidth, this really
high bandwidth

751
00:34:10,860 --> 00:34:11,750
interconnect on the chip.

752
00:34:11,750 --> 00:34:14,520
So this makes it reasonably
fast, but

753
00:34:14,520 --> 00:34:16,430
it's still a few cycles.

754
00:34:16,430 --> 00:34:18,970
There's no near neighbor?

755
00:34:18,970 --> 00:34:22,420
Yeah, a hundred cycles to go
near neighbor communication.

756
00:34:22,420 --> 00:34:24,160
Because you're still--

757
00:34:24,160 --> 00:34:26,210
you don't have that fast
mechanism of being able to

758
00:34:26,210 --> 00:34:27,910
send things points to point.

759
00:34:27,910 --> 00:34:32,020
You're putting things on the
bus and there's some

760
00:34:32,020 --> 00:34:33,690
complexity there.

761
00:34:33,690 --> 00:34:36,345
On the receive, you have the
same kind of complexity that

762
00:34:36,345 --> 00:34:37,820
you had on the send side.

763
00:34:37,820 --> 00:34:39,770
You have to know that the
message is coming, that can be

764
00:34:39,770 --> 00:34:41,150
done in different ways.

765
00:34:41,150 --> 00:34:43,790
And then you have to take that
message and write it into your

766
00:34:43,790 --> 00:34:45,380
local store.

767
00:34:45,380 --> 00:34:50,610
Which also adds some overhead in
terms of the communication

768
00:34:50,610 --> 00:34:57,970
cost. So the cell would probably
be somewhere up here,

769
00:34:57,970 --> 00:34:58,530
I would imagine.

770
00:34:58,530 --> 00:35:00,300
I didn't have a chance
to get the numbers.

771
00:35:00,300 --> 00:35:04,490
If I do, I'll update
the slide later on.

772
00:35:04,490 --> 00:35:08,770
OK, so that's essentially a
brief insight into the raw--

773
00:35:08,770 --> 00:35:09,100
yeah?

774
00:35:09,100 --> 00:35:13,550
AUDIENCE: Where did you get
the scalable processor?

775
00:35:13,550 --> 00:35:17,500
PROFESSOR RABBAH: So these are
from Michael Taylor's thesis.

776
00:35:17,500 --> 00:35:21,790
So I believe what he's done here
is just looked at some

777
00:35:21,790 --> 00:35:24,520
existing microprocessor and
essentially benchmarked

778
00:35:24,520 --> 00:35:27,050
communication latency from
one processor to another.

779
00:35:27,050 --> 00:35:30,696
AUDIENCE: So this is like going
through the cache on the

780
00:35:30,696 --> 00:35:30,830
[OBSCURED]?

781
00:35:30,830 --> 00:35:32,010
PROFESSOR RABBAH: That's
in fact how you--

782
00:35:32,010 --> 00:35:34,310
a lot of these multiprocessors
today have shared caches,

783
00:35:34,310 --> 00:35:37,770
either L-1 and more
so now it's L-2.

784
00:35:37,770 --> 00:35:38,300
So if you have--

785
00:35:38,300 --> 00:35:40,640
L-1s are dedicated to different
processors.

786
00:35:40,640 --> 00:35:41,890
But you still have to go the
memory to communicate.

787
00:35:45,750 --> 00:35:48,360
So the raw parallelizing
compiler-- yeah?

788
00:35:48,360 --> 00:35:50,310
Another question?

789
00:35:50,310 --> 00:35:52,540
AUDIENCE: You might want to
postpone this question.

790
00:35:52,540 --> 00:35:57,950
Two related questions:
so raw has--

791
00:35:57,950 --> 00:36:00,500
I guess raw has pretty well
optimized nearest neighbor

792
00:36:00,500 --> 00:36:02,170
communication.

793
00:36:02,170 --> 00:36:08,074
But we know from, for example,
Red's Rule in heuristic and

794
00:36:08,074 --> 00:36:11,910
intellectual engineering about
the number of wires needed for

795
00:36:11,910 --> 00:36:12,652
a given area.

796
00:36:12,652 --> 00:36:14,190
Is that in between--

797
00:36:14,190 --> 00:36:21,050
as I recall, it's the minimum
for a good sized circuit is

798
00:36:21,050 --> 00:36:24,592
proportional to the perimeter,
or roughly the

799
00:36:24,592 --> 00:36:27,480
square root of the area.

800
00:36:27,480 --> 00:36:33,070
And it ranges from there to--
not proportional to the area.

801
00:36:33,070 --> 00:36:34,955
There's something in between.

802
00:36:34,955 --> 00:36:36,790
Something with 3 in it.

803
00:36:36,790 --> 00:36:40,180
Like to the 3/2 power
I think, perhaps.

804
00:36:40,180 --> 00:36:41,710
No, something like 2/3rds,
something like--

805
00:36:41,710 --> 00:36:42,910
yeah, 2/3rds power.

806
00:36:42,910 --> 00:36:47,070
So the area to the 1/2 power or
area to the 2/3rds power.

807
00:36:47,070 --> 00:36:51,380
So Red's Rule says the number
of wires you need is roughly

808
00:36:51,380 --> 00:36:52,770
in that area.

809
00:36:52,770 --> 00:36:55,860
And so that sort of
pushes that--

810
00:36:55,860 --> 00:36:58,650
so the minimum you need is the
nearest communication.

811
00:36:58,650 --> 00:37:01,990
And often you need
more than that.

812
00:37:01,990 --> 00:37:06,470
We know from the FPGA experience
that nearest

813
00:37:06,470 --> 00:37:09,470
neighbor communication
is not--

814
00:37:09,470 --> 00:37:11,010
or, at least, it's good to
have move than nearest

815
00:37:11,010 --> 00:37:13,930
neighbor, and that often long
wires followed across the

816
00:37:13,930 --> 00:37:15,610
chip, in extremely high--

817
00:37:15,610 --> 00:37:16,600
PROFESSOR RABBAH: So I'm going
to actually show you an

818
00:37:16,600 --> 00:37:20,280
example where nearest neighbor
is good but you might also

819
00:37:20,280 --> 00:37:23,130
want some global mechanism
for control

820
00:37:23,130 --> 00:37:25,490
orchestration for example.

821
00:37:25,490 --> 00:37:28,470
AUDIENCE: Not just for con--
not surely just for control

822
00:37:28,470 --> 00:37:31,810
but for broadcast, for arbitrary
for the computation

823
00:37:31,810 --> 00:37:35,030
to use, not just for
the chip to use.

824
00:37:35,030 --> 00:37:38,970
Like why are you scaling out two
hops, four hops, fewer and

825
00:37:38,970 --> 00:37:39,450
fewer wire--

826
00:37:39,450 --> 00:37:42,110
PROFESSOR RABBAH: Yes, in fact
what I think is going to

827
00:37:42,110 --> 00:37:44,280
happen is a lot of these chip
designs are going to be

828
00:37:44,280 --> 00:37:45,090
heirarchical.

829
00:37:45,090 --> 00:37:49,570
You have some really global
type communication at the

830
00:37:49,570 --> 00:37:50,300
highest level.

831
00:37:50,300 --> 00:37:53,140
And then as you get within each
one of the processors,

832
00:37:53,140 --> 00:37:55,610
then you see things at the
lowest level, something that

833
00:37:55,610 --> 00:37:56,070
looked like raw.

834
00:37:56,070 --> 00:37:58,690
So you can build sort of a
hierarchy of communication

835
00:37:58,690 --> 00:38:02,590
stages that allow you to sort
of solve that problem.

836
00:38:02,590 --> 00:38:04,110
But all of that adds
complexity, right?

837
00:38:04,110 --> 00:38:05,540
First you have to solve the
problem of how do you

838
00:38:05,540 --> 00:38:09,120
parallelize for just a fixed
number of cores and then

839
00:38:09,120 --> 00:38:10,470
figure out the communications.

840
00:38:10,470 --> 00:38:13,050
Once we understand how to
do that well with a nice

841
00:38:13,050 --> 00:38:15,745
programming model then you can
build heirarchically on that.

842
00:38:15,745 --> 00:38:17,975
AUDIENCE: On the other hand, it
might make the compiler's

843
00:38:17,975 --> 00:38:20,250
job easier because it's
not as constrained.

844
00:38:20,250 --> 00:38:21,090
PROFESSOR RABBAH: It
might give you a

845
00:38:21,090 --> 00:38:21,570
nice fall back rate.

846
00:38:21,570 --> 00:38:24,915
It might save you in cases where
there are things that

847
00:38:24,915 --> 00:38:26,360
are hard to do.

848
00:38:26,360 --> 00:38:29,720
There are some issues
in the last two--

849
00:38:29,720 --> 00:38:33,120
the second to the last
three slides.

850
00:38:33,120 --> 00:38:36,862
We'll talk about an example of
where that might be the case.

851
00:38:36,862 --> 00:38:40,970
AUDIENCE: Another question
which [OBSCURED]

852
00:38:40,970 --> 00:38:45,770
so raw, I guess, being simpled
and tiled, I guess one of the

853
00:38:45,770 --> 00:38:47,436
selling points I think was that
it really cuts down on

854
00:38:47,436 --> 00:38:48,850
the engineering effort.

855
00:38:48,850 --> 00:38:49,580
PROFESSOR RABBAH:
Oh, absolutely.

856
00:38:49,580 --> 00:38:54,660
This was done a million gates
in-house for [OBSCURED]

857
00:38:54,660 --> 00:38:58,040
AUDIENCE: So a company like
Intel has a ridiculous number

858
00:38:58,040 --> 00:38:58,860
of engineers.

859
00:38:58,860 --> 00:39:01,485
And to get a competitive edge,
they something they want to

860
00:39:01,485 --> 00:39:02,431
apply more engineering to it.

861
00:39:02,431 --> 00:39:05,902
And so the question is, where
might you apply more

862
00:39:05,902 --> 00:39:07,760
engineering to try
to squeeze more--

863
00:39:07,760 --> 00:39:09,416
PROFESSOR AMARASINGHE: That's
the million dollar question

864
00:39:09,416 --> 00:39:11,220
that everybody's looking at.

865
00:39:11,220 --> 00:39:14,188
Because if somehow Intel thought
they could add more

866
00:39:14,188 --> 00:39:15,570
and more engineering.

867
00:39:15,570 --> 00:39:19,520
And then build this very complex
full-scale [OBSCURED]

868
00:39:19,520 --> 00:39:22,200
But separate vessels.

869
00:39:22,200 --> 00:39:26,500
And so I think there's still a
lot of things that is wrong.

870
00:39:26,500 --> 00:39:33,090
Meaning it's [OBSCURED]

871
00:39:33,090 --> 00:39:35,100
so at Intel basically
they will let you do

872
00:39:35,100 --> 00:39:36,450
something like that.

873
00:39:36,450 --> 00:39:39,650
They will put a lot of engineers
doing each of these

874
00:39:39,650 --> 00:39:43,640
components, finding very few,
and they can get a lot more

875
00:39:43,640 --> 00:39:47,140
performance, a lot less power
and stuff like that.

876
00:39:47,140 --> 00:39:53,150
So depending on what you want,
science is not everything.

877
00:39:53,150 --> 00:39:58,420
There are a lot of other
things [OBSCURED]

878
00:39:58,420 --> 00:40:00,820
So while it makes it easier?

879
00:40:00,820 --> 00:40:08,260
[OBSCURED]

880
00:40:08,260 --> 00:40:11,435
And the key thing is, you start
something simple and as

881
00:40:11,435 --> 00:40:14,220
you go on, you can add more
and more complexity.

882
00:40:14,220 --> 00:40:18,510
Just, as there's more
things to do.

883
00:40:18,510 --> 00:40:20,680
PROFESSOR RABBAH: Part of the
complexity might be going to--

884
00:40:20,680 --> 00:40:25,860
not making all those
[OBSCURED].

885
00:40:25,860 --> 00:40:30,030
OK, so raw pushes a lot of the
complexity into the compiler

886
00:40:30,030 --> 00:40:33,240
in that the compiler now has
to do at least two things.

887
00:40:33,240 --> 00:40:35,250
It has to distribute
the instructions.

888
00:40:35,250 --> 00:40:37,450
You have a single program and
you have to figure out how to

889
00:40:37,450 --> 00:40:39,140
parallelize it across
multiple cores.

890
00:40:39,140 --> 00:40:41,900
But not only that, because you
have the scalar operand

891
00:40:41,900 --> 00:40:44,480
network, you have to figure out
how the different cores

892
00:40:44,480 --> 00:40:45,410
have to talk to each other.

893
00:40:45,410 --> 00:40:47,790
So you have to essentially
generate schedule for the

894
00:40:47,790 --> 00:40:50,400
switch processors as well.

895
00:40:50,400 --> 00:40:52,055
So I'm going to talk a
little bit about the

896
00:40:52,055 --> 00:40:53,480
raw paralyzing compiler.

897
00:40:53,480 --> 00:40:55,470
And this is different from
a StreamIT parallelizing

898
00:40:55,470 --> 00:40:58,890
compiler which really talks
about a different program as

899
00:40:58,890 --> 00:41:01,450
an input, using a different
language.

900
00:41:01,450 --> 00:41:04,830
This is work again done here
at MIT by Walter Lee who

901
00:41:04,830 --> 00:41:07,570
graduated two years ago.

902
00:41:07,570 --> 00:41:09,050
We have a sequential program.

903
00:41:09,050 --> 00:41:14,030
You inject it into raw C seed,
raw C compiler, and you get

904
00:41:14,030 --> 00:41:17,070
fine-grained Orchestrated
Parallel execution.

905
00:41:17,070 --> 00:41:20,700
And what the compiler does is
worry about data distribution

906
00:41:20,700 --> 00:41:23,290
just like you have to do on cell
in terms of which memory

907
00:41:23,290 --> 00:41:25,270
goes into which local store.

908
00:41:25,270 --> 00:41:27,560
which competition
operates on--

909
00:41:27,560 --> 00:41:29,540
the raw compiler has to worry
about which computation

910
00:41:29,540 --> 00:41:32,460
operates on which data element
and can you put that data in

911
00:41:32,460 --> 00:41:36,370
the right caches for each
of the different tiles.

912
00:41:36,370 --> 00:41:39,400
Instruction distribution:
so the way this compiler

913
00:41:39,400 --> 00:41:41,060
essentially get parallelism,
it's going to look at

914
00:41:41,060 --> 00:41:43,270
instruction level parallelism
in your application.

915
00:41:43,270 --> 00:41:45,780
And it's going to divide that up
among the different cores.

916
00:41:45,780 --> 00:41:48,810
And then the last step is the
coordination of communication

917
00:41:48,810 --> 00:41:50,000
in control flow.

918
00:41:50,000 --> 00:41:51,330
So I'm just going
to briefly step

919
00:41:51,330 --> 00:41:53,570
through each one of those.

920
00:41:53,570 --> 00:41:56,890
So the data distribution really
has essentially trying

921
00:41:56,890 --> 00:41:58,410
to solve the problem
of locality.

922
00:41:58,410 --> 00:42:01,350
You have two instructions.

923
00:42:01,350 --> 00:42:04,030
A load into r1 from some
address and then

924
00:42:04,030 --> 00:42:05,410
you're adding r1.

925
00:42:05,410 --> 00:42:06,930
You're incrementing
that value.

926
00:42:06,930 --> 00:42:08,970
And you might write it
back for later on.

927
00:42:08,970 --> 00:42:11,110
So where would you put these
two instructions?

928
00:42:11,110 --> 00:42:15,060
So to exploit the locality, then
you want the data-- if

929
00:42:15,060 --> 00:42:18,020
the data is here, then you want
these two instructions to

930
00:42:18,020 --> 00:42:19,310
be on this tile.

931
00:42:19,310 --> 00:42:21,755
If the data is here, then you
want these two instructions to

932
00:42:21,755 --> 00:42:23,420
be on this file.

933
00:42:23,420 --> 00:42:25,700
Because it doesn't help you to
have the data here and the

934
00:42:25,700 --> 00:42:27,130
instructions here.

935
00:42:27,130 --> 00:42:29,120
Because what do you have
to do in that case?

936
00:42:29,120 --> 00:42:31,390
You have to send a message that
says, send me this data.

937
00:42:31,390 --> 00:42:34,050
And then you have to wait for
it to come in and then you

938
00:42:34,050 --> 00:42:35,020
have to operate on it.

939
00:42:35,020 --> 00:42:37,300
And then maybe you have
to write it back.

940
00:42:37,300 --> 00:42:39,220
So the compiler sort of
worries about the data

941
00:42:39,220 --> 00:42:40,030
distribution.

942
00:42:40,030 --> 00:42:42,190
It applies some data analysis.

943
00:42:42,190 --> 00:42:45,530
A lot of a thing that you saw in
Saman's lecture on classic

944
00:42:45,530 --> 00:42:47,020
parallelization technology.

945
00:42:47,020 --> 00:42:49,280
Sort of figure out the
interdependencies and then

946
00:42:49,280 --> 00:42:51,770
they can figure out how to split
up the data across the

947
00:42:51,770 --> 00:42:52,840
different cores.

948
00:42:52,840 --> 00:42:55,683
And there's some other work done
by other students in the

949
00:42:55,683 --> 00:42:58,470
group that tried to address
this problem.

950
00:42:58,470 --> 00:43:05,020
The instruction distribution is
perhaps as complicated and

951
00:43:05,020 --> 00:43:06,040
interesting.

952
00:43:06,040 --> 00:43:07,980
In here, what's going
on is-- let's say

953
00:43:07,980 --> 00:43:09,250
you have a base block.

954
00:43:09,250 --> 00:43:10,950
So you take your sequential
program.

955
00:43:10,950 --> 00:43:14,010
You figure out what are the
different basic blocks of

956
00:43:14,010 --> 00:43:17,200
computation that you have and
within the basic block you

957
00:43:17,200 --> 00:43:18,510
have lots of instructions.

958
00:43:18,510 --> 00:43:21,650
So each one of these green
boxes is a particular

959
00:43:21,650 --> 00:43:22,560
instruction.

960
00:43:22,560 --> 00:43:25,230
And what you're seeing-- these
arrows here that connect the

961
00:43:25,230 --> 00:43:28,090
edges-- are operands that
you have to exchange.

962
00:43:28,090 --> 00:43:30,320
So you might have--

963
00:43:33,190 --> 00:43:33,835
this is an add instruction.

964
00:43:33,835 --> 00:43:35,880
It requires a value
coming from here.

965
00:43:35,880 --> 00:43:36,970
Multiply--

966
00:43:36,970 --> 00:43:39,640
subtract instruction requires
values coming in from

967
00:43:39,640 --> 00:43:40,690
different areas.

968
00:43:40,690 --> 00:43:42,490
So how would you
distribute this

969
00:43:42,490 --> 00:43:44,630
across a number of cores--

970
00:43:44,630 --> 00:43:46,720
or across a number of tiles?

971
00:43:46,720 --> 00:43:50,150
Any ideas here?

972
00:43:50,150 --> 00:43:53,350
So you can look for, for
example, some chains that are

973
00:43:53,350 --> 00:43:55,330
not interconnected.

974
00:43:55,330 --> 00:43:57,540
So you can look for clusters
that you can use.

975
00:43:57,540 --> 00:44:00,940
And say, OK, well I see no edges
here so maybe I can put

976
00:44:00,940 --> 00:44:02,870
this on one tile.

977
00:44:02,870 --> 00:44:05,270
And then maybe I can put some
of these instructions on

978
00:44:05,270 --> 00:44:06,440
another tile.

979
00:44:06,440 --> 00:44:09,010
Because sort of the
communication flow is local.

980
00:44:09,010 --> 00:44:12,630
So maybe one strategy might
be, look for the longest

981
00:44:12,630 --> 00:44:15,000
single chains so you can keep
the communication flow.

982
00:44:15,000 --> 00:44:18,630
And then you apply and make and
algorithm, come up with a

983
00:44:18,630 --> 00:44:20,550
number of clusters.

984
00:44:20,550 --> 00:44:22,530
Something like that
does happen.

985
00:44:22,530 --> 00:44:26,070
And keep in mind from the
lectures we talked about the

986
00:44:26,070 --> 00:44:27,800
parallelizing compiler, you
have to worry about

987
00:44:27,800 --> 00:44:29,550
parallelism versus
communication.

988
00:44:29,550 --> 00:44:31,800
Some the more you distribute
things, the more communication

989
00:44:31,800 --> 00:44:33,240
you have to get right.

990
00:44:33,240 --> 00:44:34,640
So here we're showing--

991
00:44:34,640 --> 00:44:38,400
what I'm showing is color
mapping from the original

992
00:44:38,400 --> 00:44:41,520
instructions in the base block
to the same instructions, but

993
00:44:41,520 --> 00:44:44,290
now each color essential
represents a different cluster

994
00:44:44,290 --> 00:44:48,900
or essentially code that would
map a different thread.

995
00:44:48,900 --> 00:44:52,270
So blue is one thread, yellow is
another, green is another,

996
00:44:52,270 --> 00:44:54,260
red, purple, and so on.

997
00:44:54,260 --> 00:44:56,680
But I have to worry about
communication between the

998
00:44:56,680 --> 00:44:58,770
different colors because
they're essentially two

999
00:44:58,770 --> 00:44:59,960
different threads.

1000
00:44:59,960 --> 00:45:02,320
They're going to run on two
different processors or two

1001
00:45:02,320 --> 00:45:03,400
different tiles.

1002
00:45:03,400 --> 00:45:08,800
So those arrows that are
highlighted in dark black are

1003
00:45:08,800 --> 00:45:09,320
communication edges.

1004
00:45:09,320 --> 00:45:11,860
They have to explicitly send
the operands around.

1005
00:45:11,860 --> 00:45:14,310
Right?

1006
00:45:14,310 --> 00:45:16,470
So then I might look
at the granularity.

1007
00:45:16,470 --> 00:45:18,260
What is my communication cost?

1008
00:45:18,260 --> 00:45:19,770
What is my computation cost?

1009
00:45:19,770 --> 00:45:21,350
And I want to worry about
load balancing.

1010
00:45:21,350 --> 00:45:26,870
As we saw, load balancing can
give you how it can better

1011
00:45:26,870 --> 00:45:28,490
make use of your architecture
and give you better

1012
00:45:28,490 --> 00:45:30,770
utilization, better
throughput.

1013
00:45:30,770 --> 00:45:33,250
So you might essentially say,
it doesn't-- it's not

1014
00:45:33,250 --> 00:45:36,650
worthwhile to have these running
on a different tile

1015
00:45:36,650 --> 00:45:38,660
because there's a lot of
communication going on.

1016
00:45:38,660 --> 00:45:40,290
So maybe I'd want to fuse
those together.

1017
00:45:40,290 --> 00:45:43,870
Keep the communication local.

1018
00:45:43,870 --> 00:45:46,940
And essentially eliminate
costly communication.

1019
00:45:46,940 --> 00:45:48,680
So there are different
heuristics that you can apply.

1020
00:45:48,680 --> 00:45:51,630
You can use that 5-tuple.

1021
00:45:51,630 --> 00:45:54,310
You can use heuristic space on
the 5-tuple to determine when

1022
00:45:54,310 --> 00:45:58,510
it's profitable to break things
up and when it's not.

1023
00:45:58,510 --> 00:46:01,050
And then you have to worry
about placement.

1024
00:46:01,050 --> 00:46:04,010
So you don't quite have this
on cell in that you create

1025
00:46:04,010 --> 00:46:06,230
these SPE threads and
they can run on any

1026
00:46:06,230 --> 00:46:08,020
SPE in the raw compiler.

1027
00:46:08,020 --> 00:46:10,410
You can actually exploit the
spacial characteristics of the

1028
00:46:10,410 --> 00:46:14,010
chip in the point-to-point
communication network to say,

1029
00:46:14,010 --> 00:46:16,950
I want to put these two threads
on tile 1 and tile 2,

1030
00:46:16,950 --> 00:46:19,300
where tile 1 and tile 2 are
adjacent to each other.

1031
00:46:19,300 --> 00:46:21,770
Because I have a well-defined
communication pattern that I'm

1032
00:46:21,770 --> 00:46:22,640
going to use.

1033
00:46:22,640 --> 00:46:26,350
And map to the communication
network on the chip to get

1034
00:46:26,350 --> 00:46:29,710
really fast, really
low latency.

1035
00:46:29,710 --> 00:46:32,210
So you can take each one of
these colors, place it on a

1036
00:46:32,210 --> 00:46:33,360
different tile.

1037
00:46:33,360 --> 00:46:36,490
And now you have these wires
that are going across these

1038
00:46:36,490 --> 00:46:39,040
tiles which essentially
represent communication.

1039
00:46:39,040 --> 00:46:41,570
But now the tile has to worry
about, oh, I have to

1040
00:46:41,570 --> 00:46:43,960
essentially send these
on fixed routes.

1041
00:46:43,960 --> 00:46:46,450
There's no arbitrary
communication mechanism.

1042
00:46:46,450 --> 00:46:50,750
So if there's data going from
this tile to this tile, it

1043
00:46:50,750 --> 00:46:52,950
actually has to be routed
through a network.

1044
00:46:52,950 --> 00:46:54,830
And that might mean getting
routing through somebody

1045
00:46:54,830 --> 00:46:57,630
else's tile.

1046
00:46:57,630 --> 00:47:00,950
So the next stage would be
communication coordination.

1047
00:47:00,950 --> 00:47:05,510
You have to figure out which
switch you need to go to and

1048
00:47:05,510 --> 00:47:08,210
what do you do to get that
operand to the right switch

1049
00:47:08,210 --> 00:47:10,100
which then gets it to
the right processor.

1050
00:47:10,100 --> 00:47:12,960
So here, I believe the heuristic
is to do dimension

1051
00:47:12,960 --> 00:47:17,700
order routing so you send along
the x-dimension and then

1052
00:47:17,700 --> 00:47:18,860
the y-dimension.

1053
00:47:18,860 --> 00:47:19,650
I might have those reversed.

1054
00:47:19,650 --> 00:47:23,210
I don't know.

1055
00:47:23,210 --> 00:47:25,610
And then finally, now you've
figured out your communication

1056
00:47:25,610 --> 00:47:28,190
patterns, you've figured out
your instructions, you do some

1057
00:47:28,190 --> 00:47:29,440
instructions scheduling.

1058
00:47:29,440 --> 00:47:31,360
And what you can do here,
because the communication

1059
00:47:31,360 --> 00:47:33,965
patterns are static, you've
split up the instructions so

1060
00:47:33,965 --> 00:47:38,110
you know when you need to ship
data around and how.

1061
00:47:38,110 --> 00:47:41,010
You can guarantee deadlock
freedom by carefully ordering

1062
00:47:41,010 --> 00:47:42,690
your send and receive pairs.

1063
00:47:42,690 --> 00:47:46,370
So what you see here, every time
you see an instruction

1064
00:47:46,370 --> 00:47:48,800
that needs to ship an operand
around, there's the equivalent

1065
00:47:48,800 --> 00:47:51,590
of a route instruction
that has route east,

1066
00:47:51,590 --> 00:47:53,330
west, north, south.

1067
00:47:53,330 --> 00:47:56,940
There's an equivalent route
instruction on the other

1068
00:47:56,940 --> 00:47:57,800
processors.

1069
00:47:57,800 --> 00:48:00,590
And that allows you to
essentially analyze code and

1070
00:48:00,590 --> 00:48:04,020
say, OK, I've laid these things
out carefully, I've

1071
00:48:04,020 --> 00:48:06,330
orchestrated my send and
receive pairs so I can

1072
00:48:06,330 --> 00:48:08,800
guarantee, for example, there
are no overlapping routes.

1073
00:48:08,800 --> 00:48:12,540
Or that there are no deadlocks
because one is trying to shift

1074
00:48:12,540 --> 00:48:14,540
the other while the other is
also trying to ship, and they

1075
00:48:14,540 --> 00:48:19,000
both block on the shared
network link.

1076
00:48:19,000 --> 00:48:20,740
And finally, you have the
code representation.

1077
00:48:20,740 --> 00:48:24,050
So this is where you package
things up into object files,

1078
00:48:24,050 --> 00:48:26,420
into essentially things
like threads.

1079
00:48:26,420 --> 00:48:28,940
And then you can compile
them and run them.

1080
00:48:28,940 --> 00:48:32,580
Now the question that was posed
earlier is, well there's

1081
00:48:32,580 --> 00:48:35,290
one thing we haven't talked
about and that's branching.

1082
00:48:35,290 --> 00:48:38,700
This is a sequential program,
it executes branches.

1083
00:48:38,700 --> 00:48:41,605
And now I have this loop that
I've split up across a number

1084
00:48:41,605 --> 00:48:44,990
of tiles, how do I know who's
going to do the branch?

1085
00:48:44,990 --> 00:48:47,360
And if one tile is doing
the branch, how does it

1086
00:48:47,360 --> 00:48:49,190
communicate with
everybody else?

1087
00:48:49,190 --> 00:48:51,735
Or if I'm going to repeat the
branch on every file, does

1088
00:48:51,735 --> 00:48:53,730
that mean I'm redoing
too much computation

1089
00:48:53,730 --> 00:48:55,090
on every other tile?

1090
00:48:55,090 --> 00:48:57,960
So control coordination is
actually quite an interesting

1091
00:48:57,960 --> 00:49:00,030
aspect of--

1092
00:49:00,030 --> 00:49:01,800
adds another interesting
aspect to the

1093
00:49:01,800 --> 00:49:04,600
parallelization for raw.

1094
00:49:04,600 --> 00:49:07,830
So what you have to
do is figure out--

1095
00:49:07,830 --> 00:49:09,650
there are two different
ways you can do it.

1096
00:49:09,650 --> 00:49:14,750
Because you have no mechanism
for a global message on raw,

1097
00:49:14,750 --> 00:49:16,940
you can't say, I've taken a
branch, everybody go to this

1098
00:49:16,940 --> 00:49:17,970
program counter.

1099
00:49:17,970 --> 00:49:21,690
You essentially have to send
either the branch result so

1100
00:49:21,690 --> 00:49:24,200
one tile can do the comparison,
it calculates the

1101
00:49:24,200 --> 00:49:29,490
condition, and then it has to
communicate x to each of the

1102
00:49:29,490 --> 00:49:32,200
different branches-- to each
of the different tiles.

1103
00:49:32,200 --> 00:49:34,900
Or every tiles has to
essentially just replicate the

1104
00:49:34,900 --> 00:49:37,040
control and redo the
computations.

1105
00:49:37,040 --> 00:49:40,450
So every tile figures out what
is the condition, what are the

1106
00:49:40,450 --> 00:49:42,700
conditions for the branch.

1107
00:49:42,700 --> 00:49:45,130
They redundantly do that
computation and then they can

1108
00:49:45,130 --> 00:49:47,770
all merge at the same time--

1109
00:49:47,770 --> 00:49:49,530
at different times.

1110
00:49:49,530 --> 00:49:52,180
So that gives you two ways
of doing the branching.

1111
00:49:52,180 --> 00:49:56,720
If each tile's doing its own
control flow calculation, then

1112
00:49:56,720 --> 00:49:58,560
they can essentially branch
at different times.

1113
00:49:58,560 --> 00:50:00,790
But if they're all going to
wait for the result to

1114
00:50:00,790 --> 00:50:02,730
compare, then it essentially
gives you points where you

1115
00:50:02,730 --> 00:50:04,320
have to synchronize.

1116
00:50:04,320 --> 00:50:06,510
Everybody's going to wait for
the result of the branch.

1117
00:50:06,510 --> 00:50:08,320
But the latency could
be different.

1118
00:50:08,320 --> 00:50:10,670
Because if I'm sending the
branch condition to one tile

1119
00:50:10,670 --> 00:50:13,390
versus another file, and if
one's closer than the other.

1120
00:50:13,390 --> 00:50:16,390
Then the branch that's closer to
me-- the tile that's closer

1121
00:50:16,390 --> 00:50:18,360
to me will take that branch
earlier in time.

1122
00:50:18,360 --> 00:50:20,850
So you get sort of the
effective of a global

1123
00:50:20,850 --> 00:50:23,500
asynchronous branching
in either case.

1124
00:50:23,500 --> 00:50:27,680
Does that make sense?

1125
00:50:27,680 --> 00:50:31,400
So, in summary, the raw
architecture is really a tile

1126
00:50:31,400 --> 00:50:31,510
microprocessor.

1127
00:50:31,510 --> 00:50:36,340
It incorporates the best
elements from superscalars in

1128
00:50:36,340 --> 00:50:39,460
terms of a really low latency
communication network between

1129
00:50:39,460 --> 00:50:42,320
tiles which really cuts down
on the communication costs.

1130
00:50:42,320 --> 00:50:45,250
And as we saw, and as probably
you've been learning,

1131
00:50:45,250 --> 00:50:47,830
communication is really
an expensive part of

1132
00:50:47,830 --> 00:50:52,530
parallelization on existing
multicore chips.

1133
00:50:52,530 --> 00:50:55,670
And it's also getting the
scalability of multicores in

1134
00:50:55,670 --> 00:50:58,920
terms of explicit parallelism
but also gives you implicit

1135
00:50:58,920 --> 00:51:02,060
parallelism because the networks
are pipelined and

1136
00:51:02,060 --> 00:51:04,040
they can give you
full control.

1137
00:51:04,040 --> 00:51:06,560
So you're trying to get to the
point where you have a tile

1138
00:51:06,560 --> 00:51:09,650
processor with scalar operand
network that allows you to do

1139
00:51:09,650 --> 00:51:13,420
communication with a very low
cost. And it might be the case

1140
00:51:13,420 --> 00:51:16,640
in the future that these chips
will especially be--

1141
00:51:16,640 --> 00:51:18,925
more complex architectures will
sit on top of these so

1142
00:51:18,925 --> 00:51:22,220
you'll use these as fundamental
building blocks.

1143
00:51:22,220 --> 00:51:29,425
And there was the 80 chip
multicore from Intel: there

1144
00:51:29,425 --> 00:51:31,430
have been rumors that that might
actually be something

1145
00:51:31,430 --> 00:51:34,770
like a graphics processor that
has something like a scalar

1146
00:51:34,770 --> 00:51:35,770
operand network because
you could

1147
00:51:35,770 --> 00:51:39,030
communicate with a very fast--

1148
00:51:39,030 --> 00:51:41,010
with very low latency
between tiles.

1149
00:51:41,010 --> 00:51:44,020
And in that article which came
out a few months ago was the

1150
00:51:44,020 --> 00:51:47,610
first time I think that I had
seen tile architectures used

1151
00:51:47,610 --> 00:51:49,950
in literature or in
publications.

1152
00:51:49,950 --> 00:51:53,360
So I think you'll see more of
these kinds of designs pattern

1153
00:51:53,360 --> 00:51:57,420
appear as people scale out to
more than 2 cores, 4 cores, 8

1154
00:51:57,420 --> 00:51:59,560
cores and so on, where you
could still communication

1155
00:51:59,560 --> 00:52:01,370
reasonably well with caches.

1156
00:52:01,370 --> 00:52:03,840
And that's all I prepared
for today.

1157
00:52:03,840 --> 00:52:05,090
Any other questions?

1158
00:52:08,190 --> 00:52:09,650
And this is a list
of people who

1159
00:52:09,650 --> 00:52:11,520
contributed to the raw project.

1160
00:52:11,520 --> 00:52:13,750
A lot of students who are
led by Anant and Saman.

1161
00:52:13,750 --> 00:52:17,590
PROFESSOR AMARASINGHE:
[OBSCURED]

1162
00:52:17,590 --> 00:52:23,050
view of what happened in our
groups and then how it relates

1163
00:52:23,050 --> 00:52:25,070
to necessary to what you need.

1164
00:52:25,070 --> 00:52:30,270
But this is trying to take
it to a much finer grain.

1165
00:52:30,270 --> 00:52:33,760
Whereas in Cell, of course, the
message has to be large,

1166
00:52:33,760 --> 00:52:36,098
you can do a lot of coarse
grain stuff.

1167
00:52:36,098 --> 00:52:38,800
But in raw, you try to do much
more fine grain stuff.

1168
00:52:38,800 --> 00:52:40,065
But we're going to talk
about it the next

1169
00:52:40,065 --> 00:52:42,135
lecture on the future.

1170
00:52:42,135 --> 00:52:43,170
[OBSCURED]

1171
00:52:43,170 --> 00:52:46,961
AUDIENCE: [OBSCURED]

1172
00:52:46,961 --> 00:52:49,640
Don't you need long wires
for the clock.

1173
00:52:49,640 --> 00:52:51,230
PROFESSOR RABBAH: There's
no global clock.

1174
00:52:51,230 --> 00:52:56,617
AUDIENCE: So you have this
network that seems to --

1175
00:52:56,617 --> 00:53:00,326
So that the network actually
requires handshaking?

1176
00:53:00,326 --> 00:53:00,500
Or--

1177
00:53:00,500 --> 00:53:04,130
PROFESSOR AMARASINGHE: The way
you can do is, you can in

1178
00:53:04,130 --> 00:53:09,650
modern processors, [OBSCURED]

1179
00:53:09,650 --> 00:53:12,300
so since there's no long wire,
you can actually carry the

1180
00:53:12,300 --> 00:53:14,580
clock with the data.

1181
00:53:14,580 --> 00:53:16,370
So in the global world, the
switching here would happen

1182
00:53:16,370 --> 00:53:20,160
when the switching here.

1183
00:53:20,160 --> 00:53:23,490
But since there's no big wire
connecting, then that's OK.

1184
00:53:23,490 --> 00:53:27,340
So you can deal with
clock ticking.

1185
00:53:27,340 --> 00:53:29,187
AUDIENCE: So this is
not going to be

1186
00:53:29,187 --> 00:53:31,128
not clock drift because--

1187
00:53:31,128 --> 00:53:32,120
PROFESSOR AMARASINGHE: Yeah,
that's clock drift.

1188
00:53:32,120 --> 00:53:37,320
One end of the process clock
is happening at the global

1189
00:53:37,320 --> 00:53:38,570
instant time at the other
end of the processor.

1190
00:53:48,130 --> 00:53:52,950
And since the wires also kind
of go in the tree, you can

1191
00:53:52,950 --> 00:53:53,360
deal with that.

1192
00:53:53,360 --> 00:53:55,276
AUDIENCE: Drift meaning
ticking at

1193
00:53:55,276 --> 00:53:56,180
different rates, not just--

1194
00:53:56,180 --> 00:53:58,363
PROFESSOR AMARASINGHE:
Yeah, I know.

1195
00:53:58,363 --> 00:53:59,410
Basically I don't think
I can go back to it.

1196
00:53:59,410 --> 00:54:00,740
It has a skew.

1197
00:54:00,740 --> 00:54:05,090
There's a clock skew going
in between those.

1198
00:54:05,090 --> 00:54:07,486
AUDIENCE: So you don't need
synchronizers between the

1199
00:54:07,486 --> 00:54:07,640
different tiles?

1200
00:54:07,640 --> 00:54:08,870
PROFESSOR AMARASINGHE: No, we
don't need synchronizers

1201
00:54:08,870 --> 00:54:09,640
because tiles are local.

1202
00:54:09,640 --> 00:54:11,400
The clock would bring
those tiles.

1203
00:54:11,400 --> 00:54:14,210
The clock would bring two things
that communicate close

1204
00:54:14,210 --> 00:54:17,100
enough that it fits
it in the cycle.

1205
00:54:17,100 --> 00:54:20,335
But for example, if you get it
two very far away branches of

1206
00:54:20,335 --> 00:54:23,169
a tree and then if you try to
communicate with them then you

1207
00:54:23,169 --> 00:54:23,450
have a problem.

1208
00:54:23,450 --> 00:54:27,683
Another thing is when the tree
goes here, you want to use two

1209
00:54:27,683 --> 00:54:30,170
different branches it's
similar to going down.

1210
00:54:30,170 --> 00:54:31,630
So you can compress
the process.

1211
00:54:31,630 --> 00:54:32,810
So there are all these things.

1212
00:54:32,810 --> 00:54:34,060
I mean, modern processors
really really destable.

1213
00:54:37,310 --> 00:54:40,538
The problem occurs when you try
to connect directly from

1214
00:54:40,538 --> 00:54:44,270
the far end of the branch to
something that gets clocked

1215
00:54:44,270 --> 00:54:48,265
there to something that
clocks at a very

1216
00:54:48,265 --> 00:54:48,613
early end of the branch.

1217
00:54:48,613 --> 00:54:50,070
If you're trying to connect
those two, then the skew might

1218
00:54:50,070 --> 00:54:51,150
be too long.

1219
00:54:51,150 --> 00:54:53,042
Then you can get into
clock trouble.

1220
00:54:53,042 --> 00:54:53,770
AUDIENCE: [OBSCURED]

1221
00:54:53,770 --> 00:54:57,283
I was just worried about
this local network.

1222
00:54:57,283 --> 00:55:04,224
[OBSCURED]

1223
00:55:04,224 --> 00:55:11,386
AUDIENCE: Another question I had
was in the mesh, obviously

1224
00:55:11,386 --> 00:55:15,897
the processors in the middle
have further to get to the I/O

1225
00:55:15,897 --> 00:55:18,860
devices or to the main memory.

1226
00:55:18,860 --> 00:55:22,641
What do you see happening as you
get to larger and larger

1227
00:55:22,641 --> 00:55:23,144
processors?

1228
00:55:23,144 --> 00:55:25,282
Are they going to just put more
and more local memory on

1229
00:55:25,282 --> 00:55:26,858
the tile and [OBSCURED]

1230
00:55:26,858 --> 00:55:30,500
it, or are they going to add
extra memory buses on it?

1231
00:55:30,500 --> 00:55:32,225
PROFESSOR RABBAH: It could
be a combination of both.

1232
00:55:32,225 --> 00:55:35,950
So it's not just memory,
I/O devices.

1233
00:55:35,950 --> 00:55:38,370
If you're doing I/O then you
might to be placed at a part

1234
00:55:38,370 --> 00:55:42,165
of the chip that has direct
access to an I/O device or

1235
00:55:42,165 --> 00:55:43,550
very close.

1236
00:55:43,550 --> 00:55:46,600
It also comes up in the case
of the communication

1237
00:55:46,600 --> 00:55:47,470
orchestration.

1238
00:55:47,470 --> 00:55:51,680
So if this guy is doing the
branch then you want him

1239
00:55:51,680 --> 00:55:53,270
essentially centrally located.

1240
00:55:53,270 --> 00:55:56,150
So the best patterns for
allocating things is

1241
00:55:56,150 --> 00:55:57,020
essentially across.

1242
00:55:57,020 --> 00:55:59,670
It's like a plus sign where
it branches in the middle.

1243
00:55:59,670 --> 00:56:02,470
PROFESSOR AMARASINGHE: But
that's not [OBSCURED].

1244
00:56:02,470 --> 00:56:07,420
You can make them uniform by
everybody equally there.

1245
00:56:07,420 --> 00:56:11,624
And a lot of times people have
done that simple model with

1246
00:56:11,624 --> 00:56:16,353
everybody equally there Or you
try to take advantage of

1247
00:56:16,353 --> 00:56:16,670
closeness and stuff like that.

1248
00:56:16,670 --> 00:56:16,950
So you can't have both ways.

1249
00:56:16,950 --> 00:56:19,730
So anytime you try to
make me [OBSCURED]

1250
00:56:19,730 --> 00:56:24,580
very, very close and fast
access, you're are doing it by

1251
00:56:24,580 --> 00:56:30,652
basically making the other parts
to have less resources

1252
00:56:30,652 --> 00:56:32,240
and less access.

1253
00:56:32,240 --> 00:56:34,760
On the other hand, there are
a lot of people working on

1254
00:56:34,760 --> 00:56:38,690
[INAUDIBLE]

1255
00:56:38,690 --> 00:56:41,655
things that, for example,
there's a thing called tree

1256
00:56:41,655 --> 00:56:42,990
space laser.

1257
00:56:42,990 --> 00:56:45,920
So what that does is you put a
mirror on top of the tile, on

1258
00:56:45,920 --> 00:56:48,500
top of the processor.

1259
00:56:48,500 --> 00:56:58,920
And each of these-- you can
embed a small LED transmitter

1260
00:56:58,920 --> 00:56:59,490
into the chip.

1261
00:56:59,490 --> 00:57:01,435
So basically if you want to
communicate with someone, you

1262
00:57:01,435 --> 00:57:03,740
just bounce that laser on
top of that and get it

1263
00:57:03,740 --> 00:57:04,660
to the right guy.

1264
00:57:04,660 --> 00:57:07,100
So there are a lot of exotic
things that might be able to

1265
00:57:07,100 --> 00:57:09,150
solve this thing, technological
problem.

1266
00:57:09,150 --> 00:57:11,160
But in some case,
speed of light--

1267
00:57:11,160 --> 00:57:14,860
I don't think an engineer
has figured out how to

1268
00:57:14,860 --> 00:57:15,430
break speed of light.

1269
00:57:15,430 --> 00:57:17,925
Unless, of course, people go
with quantum computing and

1270
00:57:17,925 --> 00:57:18,870
stuff like that.

1271
00:57:18,870 --> 00:57:21,930
So, I mean the key thing is, you
have resources, you have

1272
00:57:21,930 --> 00:57:22,660
certain data and you just
have to deal with it.

1273
00:57:22,660 --> 00:57:26,190
Getting nice uniformity
has a cost.

1274
00:57:26,190 --> 00:57:27,340
PROFESSOR RABBAH: Yeah, I
mean, on the [OBSCURED]

1275
00:57:27,340 --> 00:57:30,650
that are groups here at MIT
who are working on optical

1276
00:57:30,650 --> 00:57:32,210
networks in the third
dimension.

1277
00:57:32,210 --> 00:57:33,956
So you have a tile chip plus
an optical network in the

1278
00:57:33,956 --> 00:57:35,990
third dimension which allows
you to do things like

1279
00:57:35,990 --> 00:57:38,214
broadcast much more
efficiently.

1280
00:57:38,214 --> 00:57:38,752
OK?

1281
00:57:38,752 --> 00:57:40,300
PROFESSOR AMARASINGHE: I guess
we'll take a break here and

1282
00:57:40,300 --> 00:57:42,286
take a small, three-minute break
and then we can go on to

1283
00:57:42,286 --> 00:57:43,536
the next topic.