1
00:00:00,040 --> 00:00:02,480
The following content is
provided under a Creative

2
00:00:02,480 --> 00:00:04,000
Commons license.

3
00:00:04,000 --> 00:00:06,340
Your support will help
MIT OpenCourseWare

4
00:00:06,340 --> 00:00:10,710
continue to offer high quality
educational resources for free.

5
00:00:10,710 --> 00:00:13,320
To make a donation or
view additional materials

6
00:00:13,320 --> 00:00:17,197
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,197 --> 00:00:17,822
at ocw.mit.edu.

8
00:00:22,429 --> 00:00:23,970
JAMES GERACI: My
name's James Geraci.

9
00:00:23,970 --> 00:00:28,280
I worked with Sudarshan and
John Chu on this project.

10
00:00:28,280 --> 00:00:31,944
And what we did is we took
a numerical simulation,

11
00:00:31,944 --> 00:00:33,610
electrochemical
simulation, of a battery

12
00:00:33,610 --> 00:00:38,620
and ported it to
the Playstation 3.

13
00:00:38,620 --> 00:00:42,070
So the objective of this
project for this situation

14
00:00:42,070 --> 00:00:45,730
was to port an electrochemical
simulation to the PlayStation 3

15
00:00:45,730 --> 00:00:48,480
and try to take advantage of
the unique features of the cell

16
00:00:48,480 --> 00:00:53,660
processor to help speed up the
computation of the simulation.

17
00:00:53,660 --> 00:00:56,335
So when we talk to everyone, we
talk why we'd want to do that.

18
00:00:56,335 --> 00:00:57,710
We're going to
discuss the model.

19
00:00:57,710 --> 00:01:00,704
We're going to discuss how
we solved it numerically,

20
00:01:00,704 --> 00:01:02,370
look at some of our
performance results,

21
00:01:02,370 --> 00:01:04,703
and then discuss what we wish
we could have implemented.

22
00:01:07,420 --> 00:01:10,700
So the justification is
basically the world needs

23
00:01:10,700 --> 00:01:14,070
energy, and oil has
been in the news a lot

24
00:01:14,070 --> 00:01:18,950
and hasn't been as a reliable
source of energy as we'd like.

25
00:01:18,950 --> 00:01:20,780
So in the automotive
industry, hybrids

26
00:01:20,780 --> 00:01:25,160
seem to be part of the
solution to the energy needs.

27
00:01:25,160 --> 00:01:27,180
Now, one the problems
of hybrids is how much

28
00:01:27,180 --> 00:01:28,590
energy is in your battery.

29
00:01:28,590 --> 00:01:29,980
So the more
sophisticated battery

30
00:01:29,980 --> 00:01:32,320
model that we can put
in a hybrid, the better

31
00:01:32,320 --> 00:01:34,280
that we're going to be off.

32
00:01:34,280 --> 00:01:39,180
So a more sophisticated model
requires more computation,

33
00:01:39,180 --> 00:01:41,610
so you're looking at
having a large superscalar

34
00:01:41,610 --> 00:01:44,020
processor in your
car, or you could

35
00:01:44,020 --> 00:01:46,160
see if you could get a
lot of small processors

36
00:01:46,160 --> 00:01:47,830
to do the same amount of work.

37
00:01:47,830 --> 00:01:50,300
And the cell represents
the first opportunity

38
00:01:50,300 --> 00:01:53,150
be able to test that out.

39
00:01:53,150 --> 00:01:56,550
So the simulation
that we ran basically

40
00:01:56,550 --> 00:01:59,460
consisted of a lead
acid battery cell.

41
00:01:59,460 --> 00:02:01,430
So unfortunately, they
used the word "cell,"

42
00:02:01,430 --> 00:02:03,210
and we're also using
the word "cell."

43
00:02:03,210 --> 00:02:07,250
So it's sort of like
the modern day Smurf.

44
00:02:07,250 --> 00:02:09,389
So in a lead-acid
battery cell, you

45
00:02:09,389 --> 00:02:11,580
have a lead dioxide
electrode, which

46
00:02:11,580 --> 00:02:14,470
consists of a number of
parts; a lead electrode,

47
00:02:14,470 --> 00:02:16,220
which also consists
of a number of parts;

48
00:02:16,220 --> 00:02:19,450
and an electrolyte, which is
5 molar sulfuric acid, which

49
00:02:19,450 --> 00:02:20,430
consists of H2SO4.

50
00:02:24,530 --> 00:02:27,830
When you discharge
the battery, you're

51
00:02:27,830 --> 00:02:29,744
going to close the circuit.

52
00:02:29,744 --> 00:02:31,410
And basically, what
ends up happening is

53
00:02:31,410 --> 00:02:34,244
you have electrons
flying on the outer loop.

54
00:02:34,244 --> 00:02:35,660
And between the
electrodes, you're

55
00:02:35,660 --> 00:02:38,180
going to have an ionic current.

56
00:02:38,180 --> 00:02:41,900
So you have a complete
circuit around your loop.

57
00:02:41,900 --> 00:02:43,970
And this is the effect
that the simulation

58
00:02:43,970 --> 00:02:46,910
is trying to simulate.

59
00:02:46,910 --> 00:02:53,832
The simulation works on a
very low physical level,

60
00:02:53,832 --> 00:02:55,790
and it stimulates the
chemical reactions, which

61
00:02:55,790 --> 00:02:57,180
at the lead dioxide
electrode, it

62
00:02:57,180 --> 00:03:00,700
converts lead dioxide into
lead sulfate during discharge,

63
00:03:00,700 --> 00:03:02,420
reverses it during charge.

64
00:03:02,420 --> 00:03:05,280
During discharge,
the lead electrode

65
00:03:05,280 --> 00:03:06,930
turns lead into lead sulfate.

66
00:03:06,930 --> 00:03:09,475
And during recharge,
it converts that back.

67
00:03:13,810 --> 00:03:16,760
To implement the model, we
took basically and discretized

68
00:03:16,760 --> 00:03:19,330
the system and just
kind of cross-sectioned

69
00:03:19,330 --> 00:03:23,864
it-- took small sections
and cross-sectioned it.

70
00:03:23,864 --> 00:03:25,155
We're going to skip this slide.

71
00:03:25,155 --> 00:03:26,238
I don't know what that is.

72
00:03:26,238 --> 00:03:28,730
But basically, to
simulate the physics,

73
00:03:28,730 --> 00:03:30,820
we have a bunch of
nasty, nonlinear coupled

74
00:03:30,820 --> 00:03:32,560
partial differential equations.

75
00:03:32,560 --> 00:03:34,580
This is an example
of two of them.

76
00:03:34,580 --> 00:03:36,850
And there are a
number of other ones

77
00:03:36,850 --> 00:03:41,270
that are included in the system
but not really worth seeing.

78
00:03:41,270 --> 00:03:45,710
Just kind of a feel, you
see a diffusion term.

79
00:03:45,710 --> 00:03:47,710
In the lower equation,
you see a diffusion term.

80
00:03:47,710 --> 00:03:49,280
You see a reaction term.

81
00:03:49,280 --> 00:03:53,760
And the upper equation, you
see changing of the porosity,

82
00:03:53,760 --> 00:03:56,510
changing the geometry
of the system.

83
00:03:56,510 --> 00:03:58,000
This is kind of
the output that you

84
00:03:58,000 --> 00:04:01,604
could get from this type
of battery simulation.

85
00:04:01,604 --> 00:04:03,020
And what you're
looking at here is

86
00:04:03,020 --> 00:04:05,780
you're looking at-- if
the left-hand side is

87
00:04:05,780 --> 00:04:10,710
the electrodes, is the
top of the battery,

88
00:04:10,710 --> 00:04:13,670
and the height of the
battery goes along that way,

89
00:04:13,670 --> 00:04:17,600
you see that as you
discharge, the concentration

90
00:04:17,600 --> 00:04:20,360
at the top of the electrodes
is much-- the acid is

91
00:04:20,360 --> 00:04:23,410
consumed where you're drawing
current out most readily.

92
00:04:23,410 --> 00:04:26,830
So physically, the simulation
kind of makes sense,

93
00:04:26,830 --> 00:04:29,860
even if my speech didn't.

94
00:04:29,860 --> 00:04:33,350
The battery model that we used
is a two-dimensional model.

95
00:04:33,350 --> 00:04:37,090
It has both a lead dioxide
electrode and a lead electrode

96
00:04:37,090 --> 00:04:38,500
and an electrolyte area.

97
00:04:38,500 --> 00:04:41,787
This is an example of just
the lead dioxide electrode.

98
00:04:41,787 --> 00:04:43,620
And if we discretize
that in two dimensions,

99
00:04:43,620 --> 00:04:46,500
we would get a
discretization that

100
00:04:46,500 --> 00:04:48,220
looks like that
if we were to use

101
00:04:48,220 --> 00:04:51,469
a single-centered
finite-volume method.

102
00:04:51,469 --> 00:04:54,010
But it turns out that we have
so many discontinuities in here

103
00:04:54,010 --> 00:04:56,340
that it's much better
to use a staggered grid.

104
00:04:56,340 --> 00:05:02,230
And so we're using actually
two different meshes

105
00:05:02,230 --> 00:05:05,270
to simulate this system.

106
00:05:05,270 --> 00:05:07,050
Well, on a single
mesh, you would end up

107
00:05:07,050 --> 00:05:10,370
having four corners,
four edges and a center.

108
00:05:10,370 --> 00:05:13,070
So you would get nine
different types of objects

109
00:05:13,070 --> 00:05:14,650
and one electrode alone.

110
00:05:14,650 --> 00:05:17,890
And you're talking about
a rather large and nasty

111
00:05:17,890 --> 00:05:19,337
simulation.

112
00:05:19,337 --> 00:05:20,920
On a small scale,
the simulation would

113
00:05:20,920 --> 00:05:24,480
produce a very sparse matrix
that looks something like this.

114
00:05:24,480 --> 00:05:26,980
And Sudarshan will
now come up and talk

115
00:05:26,980 --> 00:05:31,157
to you about the matrices and
how we solved this system.

116
00:05:37,882 --> 00:05:39,923
SUDARSHAN RAGHUNATHAN:
[INAUDIBLE] slight context

117
00:05:39,923 --> 00:05:43,332
[INAUDIBLE] speakers.

118
00:05:43,332 --> 00:05:47,715
So basically, like John said,
the problem, we are essentially

119
00:05:47,715 --> 00:05:50,340
trying to parallelize the most
computationally expensive part

120
00:05:50,340 --> 00:05:52,470
of this whole
simulation, this solving

121
00:05:52,470 --> 00:05:57,510
for the Newton iterations
during each [? lower ?] step.

122
00:05:57,510 --> 00:06:00,397
And so basically, in an
abstract-- I'm sorry.

123
00:06:00,397 --> 00:06:00,980
Any questions?

124
00:06:00,980 --> 00:06:01,560
AUDIENCE: Speak louder.

125
00:06:01,560 --> 00:06:02,298
SUDARSHAN RAGHUNATHAN: Sorry?

126
00:06:02,298 --> 00:06:03,254
AUDIENCE: Louder.

127
00:06:03,254 --> 00:06:03,492
AUDIENCE: [INAUDIBLE].

128
00:06:03,492 --> 00:06:04,210
The microphone [INAUDIBLE].

129
00:06:04,210 --> 00:06:05,260
SUDARSHAN RAGHUNATHAN:
I'll try, but you've got

130
00:06:05,260 --> 00:06:06,469
to try and listen harder too.

131
00:06:06,469 --> 00:06:08,051
PROFESSOR: That only
works for the TV.

132
00:06:08,051 --> 00:06:09,100
You have to yell at them.

133
00:06:09,100 --> 00:06:10,076
SUDARSHAN RAGHUNATHAN:
I got it, yeah.

134
00:06:10,076 --> 00:06:11,254
Is it any better now?

135
00:06:11,254 --> 00:06:13,004
PROFESSOR: No, you've
got to yell at them.

136
00:06:13,004 --> 00:06:13,980
SUDARSHAN RAGHUNATHAN:
Is it any better now?

137
00:06:13,980 --> 00:06:14,468
AUDIENCE: Yeah.

138
00:06:14,468 --> 00:06:15,160
SUDARSHAN RAGHUNATHAN: OK, cool.

139
00:06:15,160 --> 00:06:16,730
PROFESSOR: Yeah, I can hear.

140
00:06:16,730 --> 00:06:17,855
SUDARSHAN RAGHUNATHAN: So--

141
00:06:17,855 --> 00:06:18,728
[LAUGHTER]

142
00:06:19,228 --> 00:06:22,070
Yeah, I always like to say
I'll try to speak harder.

143
00:06:22,070 --> 00:06:23,611
You should try to
listen harder, too.

144
00:06:23,611 --> 00:06:27,510
So we meet somewhere in between.

145
00:06:27,510 --> 00:06:29,360
So basically in
an abstract sense,

146
00:06:29,360 --> 00:06:32,680
we are trying to solve a system
of sparse linear equations.

147
00:06:32,680 --> 00:06:35,300
And for simplicity,
for the first pass,

148
00:06:35,300 --> 00:06:37,510
we treated the matrix
as being dense.

149
00:06:37,510 --> 00:06:41,900
So we did the whole Gauss
elimination ignoring

150
00:06:41,900 --> 00:06:44,155
all the zeroes of the matrix.

151
00:06:44,155 --> 00:06:48,240
And so the basic idea is
that we do Gauss elimination

152
00:06:48,240 --> 00:06:49,400
with partial pivoting.

153
00:06:49,400 --> 00:06:52,240
And it's the forward elimination
stage of the Gauss elimination

154
00:06:52,240 --> 00:06:54,390
that [? Rn cubed ?]
that is most expensive.

155
00:06:54,390 --> 00:06:56,390
And that is the thing
we've tried to parallelize

156
00:06:56,390 --> 00:06:59,020
between among the SPUs.

157
00:06:59,020 --> 00:07:01,940
And the partial pivoting and the
back subs are done on the PPUs

158
00:07:01,940 --> 00:07:04,090
because that turned
out to be much faster.

159
00:07:08,660 --> 00:07:12,530
So this is the matrix
that we are considering

160
00:07:12,530 --> 00:07:14,500
during each step for
the random entries

161
00:07:14,500 --> 00:07:18,922
and for the random
right-hand side.

162
00:07:18,922 --> 00:07:22,040
And then so the way
Gauss elimination works

163
00:07:22,040 --> 00:07:25,360
is that we start off
with one of the rows

164
00:07:25,360 --> 00:07:26,820
that you call the
base row, which

165
00:07:26,820 --> 00:07:28,069
is where the pivots come from.

166
00:07:31,260 --> 00:07:33,630
And then we eliminate the
variable corresponding

167
00:07:33,630 --> 00:07:36,610
to base row from all the
subsequent elimination rows.

168
00:07:36,610 --> 00:07:38,770
And all these eliminations
can be done in parallel.

169
00:07:38,770 --> 00:07:40,645
And that's what we have
tried to parallelize.

170
00:07:40,645 --> 00:07:45,290
So each SPU grabs hold of the
next available elimination row,

171
00:07:45,290 --> 00:07:47,860
eliminates it, and
writes a result back

172
00:07:47,860 --> 00:07:49,470
on to the main memory.

173
00:07:49,470 --> 00:07:51,524
So if, for example,
if we consider

174
00:07:51,524 --> 00:07:54,810
the situation with the
two SPUs, what we do

175
00:07:54,810 --> 00:08:00,000
is that we stream
the elimination rows

176
00:08:00,000 --> 00:08:01,140
across all the SPUs.

177
00:08:01,140 --> 00:08:02,834
Each SPU performs
the elimination

178
00:08:02,834 --> 00:08:07,810
and writes the result back to
main memory, as you can see.

179
00:08:07,810 --> 00:08:12,010
And this happens for every row.

180
00:08:12,010 --> 00:08:15,577
So in this simulation, the first
variable has been eliminated.

181
00:08:15,577 --> 00:08:17,910
As you can see, there are
zeroes along the first column.

182
00:08:17,910 --> 00:08:20,680
And the process is recursive.

183
00:08:20,680 --> 00:08:23,050
After the first
variable is completed,

184
00:08:23,050 --> 00:08:26,540
it moves on to
the next variable.

185
00:08:26,540 --> 00:08:29,729
And the way the elimination
rows are returned,

186
00:08:29,729 --> 00:08:31,395
they're delivered in
a cyclical fashion.

187
00:08:31,395 --> 00:08:34,327
So there is no
serial bottleneck.

188
00:08:34,327 --> 00:08:35,860
It's all perfectly
load balanced.

189
00:08:35,860 --> 00:08:37,000
PROFESSOR: Have you--

190
00:08:37,000 --> 00:08:37,330
SUDARSHAN RAGHUNATHAN:
I'm sorry?

191
00:08:37,330 --> 00:08:38,460
PROFESSOR: --change?

192
00:08:38,460 --> 00:08:38,809
SUDARSHAN RAGHUNATHAN:
I'm sorry?

193
00:08:38,809 --> 00:08:40,243
PROFESSOR: If you
eliminate something lower,

194
00:08:40,243 --> 00:08:41,677
doesn't the pivot value change?

195
00:08:41,677 --> 00:08:44,469
I'm completely not getting
what you are saying.

196
00:08:44,469 --> 00:08:45,510
That's my first question.

197
00:08:45,510 --> 00:08:45,920
SUDARSHAN RAGHUNATHAN: Yes.

198
00:08:45,920 --> 00:08:48,334
PROFESSOR: Second question
is, is this sparse or dense?

199
00:08:48,334 --> 00:08:49,750
What's your matrix
representation?

200
00:08:49,750 --> 00:08:51,070
SUDARSHAN RAGHUNATHAN: The
matrix representation currently

201
00:08:51,070 --> 00:08:52,460
is dense.

202
00:08:52,460 --> 00:08:53,320
PROFESSOR: OK.

203
00:08:53,320 --> 00:08:54,260
SUDARSHAN RAGHUNATHAN:
Yeah, so we've--

204
00:08:54,260 --> 00:08:55,244
PROFESSOR: Doesn't
the pivot value

205
00:08:55,244 --> 00:08:56,720
keep changing when
you eliminate?

206
00:08:56,720 --> 00:08:59,672
So if it's completely parallel,
and doesn't that [INAUDIBLE]?

207
00:09:03,390 --> 00:09:05,920
SUDARSHAN RAGHUNATHAN: So
all the subsequent variables

208
00:09:05,920 --> 00:09:10,213
that are eliminated
for each variable,

209
00:09:10,213 --> 00:09:12,020
all can be done in parallel.

210
00:09:12,020 --> 00:09:13,940
But after one variable
has been eliminated,

211
00:09:13,940 --> 00:09:17,545
there's a synchronization
step where all the SPUs

212
00:09:17,545 --> 00:09:18,770
have to synchronize.

213
00:09:18,770 --> 00:09:23,114
There's a barrier after each
row has been eliminated.

214
00:09:23,114 --> 00:09:24,214
PROFESSOR: OK.

215
00:09:24,214 --> 00:09:26,130
SUDARSHAN RAGHUNATHAN:
Yeah, so you start it--

216
00:09:26,130 --> 00:09:27,800
so there's a
totally [INAUDIBLE],

217
00:09:27,800 --> 00:09:30,160
if you write the algorithm down.

218
00:09:30,160 --> 00:09:32,808
And then it's the
inner two loops that

219
00:09:32,808 --> 00:09:34,058
are totally parallelized with.

220
00:09:34,058 --> 00:09:37,443
And the second loop is what
we're trying to parallelize.

221
00:09:37,443 --> 00:09:38,026
PROFESSOR: OK.

222
00:09:38,026 --> 00:09:39,514
SUDARSHAN RAGHUNATHAN:
Is that a little clearer?

223
00:09:39,514 --> 00:09:40,014
Yeah.

224
00:09:40,014 --> 00:09:43,482
So this is the basic algorithm.

225
00:09:43,482 --> 00:09:47,946
And I'll let John talk about
some of the performance numbers

226
00:09:47,946 --> 00:09:48,938
we've got.

227
00:09:48,938 --> 00:09:53,820
But some of the optimizations
that we haven't done

228
00:09:53,820 --> 00:09:56,070
are the use of the
[? last three ?] operations,

229
00:09:56,070 --> 00:09:59,855
like extreme multiple
rows at a time and use

230
00:09:59,855 --> 00:10:01,316
of [INAUDIBLE] operations.

231
00:10:01,316 --> 00:10:03,751
But I'll let John talk about
the performance numbers.

232
00:10:16,340 --> 00:10:18,340
JOHN CHU: Computation of
the current elimination

233
00:10:18,340 --> 00:10:22,190
row with the fetching of
the next elimination row,

234
00:10:22,190 --> 00:10:24,550
we get roughly a 30% speedup.

235
00:10:24,550 --> 00:10:28,046
So here's a graph
to see how both

236
00:10:28,046 --> 00:10:33,363
the unbuffered and buffered
version perform scattered

237
00:10:33,363 --> 00:10:35,040
with a number of SPUs.

238
00:10:35,040 --> 00:10:39,570
So as you can see, both
look roughly pretty linear.

239
00:10:39,570 --> 00:10:44,478
And that makes
sense because the--

240
00:10:54,740 --> 00:11:00,210
So here's another
graph of how we

241
00:11:00,210 --> 00:11:03,640
scaled with the size of the
matrix that we're working with.

242
00:11:03,640 --> 00:11:05,750
So the smaller the
matrix, the higher

243
00:11:05,750 --> 00:11:08,150
the communication/computation
ratio.

244
00:11:08,150 --> 00:11:10,570
So the communication latency
should be more apparent.

245
00:11:10,570 --> 00:11:18,322
But as you can see,
even as you get smaller,

246
00:11:18,322 --> 00:11:22,620
the computations is
still pretty linear.

247
00:11:22,620 --> 00:11:23,120
So--

248
00:11:23,120 --> 00:11:24,614
PROFESSOR: Yeah, it's linear.

249
00:11:24,614 --> 00:11:31,586
But it's not just [INAUDIBLE].

250
00:11:31,586 --> 00:11:35,072
It's not running at 2x
performance improvement.

251
00:11:35,072 --> 00:11:37,562
But [INAUDIBLE] now.

252
00:11:37,562 --> 00:11:41,546
So when you put 16 there,
why is it only 30%?

253
00:11:41,546 --> 00:11:44,534
Why is it not leading to
2x performance improvement?

254
00:11:44,534 --> 00:11:46,526
So is it because of underclock?

255
00:11:46,526 --> 00:11:48,518
Because if it's
underclocked, your graph

256
00:11:48,518 --> 00:11:50,510
shouldn't have this
kind of a pattern.

257
00:11:50,510 --> 00:11:52,998
It should have a more
exponential type pattern.

258
00:11:52,998 --> 00:11:53,498
[INAUDIBLE].

259
00:11:56,189 --> 00:11:57,980
PROFESSOR: Did you
understand the question?

260
00:11:57,980 --> 00:11:59,720
JOHN CHU: Yep.

261
00:11:59,720 --> 00:12:03,740
PROFESSOR: So why aren't
you getting a 2x speedup?

262
00:12:03,740 --> 00:12:05,988
I mean, your line is
straight, but it's not

263
00:12:05,988 --> 00:12:07,944
increasing proportional
to the number of SPUs

264
00:12:07,944 --> 00:12:08,694
you're increasing.

265
00:12:08,694 --> 00:12:10,662
So your efficiency
might not be at 100%.

266
00:12:10,662 --> 00:12:12,650
Any idea why?

267
00:12:12,650 --> 00:12:15,904
JAMES GERACI: Yeah, that
[? looks ?] horrible.

268
00:12:15,904 --> 00:12:17,320
JOHN CHU: The 2x
is only-- this is

269
00:12:17,320 --> 00:12:20,073
[INAUDIBLE] [? laptop ?],
so two would only

270
00:12:20,073 --> 00:12:22,006
shift it up and down, right?

271
00:12:22,006 --> 00:12:24,006
PROFESSOR: So it looks
like you're not operating

272
00:12:24,006 --> 00:12:26,466
at 100% efficiency, right?

273
00:12:26,466 --> 00:12:27,841
JAMES GERACI: Oh,
definitely not.

274
00:12:27,841 --> 00:12:28,280
JOHN CHU: Yeah.

275
00:12:28,280 --> 00:12:29,199
PROFESSOR: So I think to--

276
00:12:29,199 --> 00:12:31,324
PROFESSOR: [INAUDIBLE]
conversation going on there.

277
00:12:31,324 --> 00:12:33,690
But this a strange
type of a graph

278
00:12:33,690 --> 00:12:40,675
because if [INAUDIBLE] as
you get a very nice linear

279
00:12:40,675 --> 00:12:41,175
[INAUDIBLE].

280
00:12:41,175 --> 00:12:44,418
If you have [? undersize ?]
coordinate, what you get

281
00:12:44,418 --> 00:12:46,704
is you have a good
speed at very beginning,

282
00:12:46,704 --> 00:12:47,662
and then you slow down.

283
00:12:47,662 --> 00:12:50,157
You get a [INAUDIBLE]
type of a graph.

284
00:12:50,157 --> 00:12:52,402
So it's very much--
I haven't seen things

285
00:12:52,402 --> 00:12:58,803
like a strange line, but not
as [INAUDIBLE] on your graph.

286
00:12:58,803 --> 00:13:00,636
This is kind of a strange
thing. [INAUDIBLE]

287
00:13:00,636 --> 00:13:03,630
to see where and
why does it happen.

288
00:13:03,630 --> 00:13:05,130
JOHN CHU: OK.

289
00:13:05,130 --> 00:13:08,720
Well, yeah, so
I'll just move on.

290
00:13:08,720 --> 00:13:10,780
So I'll talk about some
of the further work

291
00:13:10,780 --> 00:13:12,150
we're trying to do.

292
00:13:12,150 --> 00:13:12,816
PROFESSOR: John?

293
00:13:12,816 --> 00:13:13,441
JOHN CHU: Yeah?

294
00:13:13,441 --> 00:13:19,440
PROFESSOR: Can you clip the mic
to your-- yeah, that'll work.

295
00:13:19,440 --> 00:13:23,030
JOHN CHU: So we've also
tried to triple buffer

296
00:13:23,030 --> 00:13:27,900
by pipelining the
fetching of the next row

297
00:13:27,900 --> 00:13:29,650
with the computation
of the current row

298
00:13:29,650 --> 00:13:33,332
with the sending of
the previous row.

299
00:13:33,332 --> 00:13:34,790
We did this, but
it didn't actually

300
00:13:34,790 --> 00:13:39,136
give us the speedup that
we thought we would get.

301
00:13:39,136 --> 00:13:42,850
But what we're working on how
is SIMDizing the computation

302
00:13:42,850 --> 00:13:43,660
stage.

303
00:13:43,660 --> 00:13:49,010
So right now, we treat all the
doubles in the rows as scalars.

304
00:13:49,010 --> 00:13:53,790
But by SIMDizing, hopefully
we can get a speedup

305
00:13:53,790 --> 00:13:55,690
by a factor of two.

306
00:13:55,690 --> 00:13:58,390
And also right now,
our LU algorithm

307
00:13:58,390 --> 00:14:01,880
isn't very smart in that
when we fetch the elimination

308
00:14:01,880 --> 00:14:04,990
row, as we go on
through the algorithm,

309
00:14:04,990 --> 00:14:08,750
part of the elimination row
becomes zeroes as we move down.

310
00:14:08,750 --> 00:14:13,860
So we don't really need to fetch
the entire elimination row.

311
00:14:13,860 --> 00:14:18,210
And so by taking
that into account,

312
00:14:18,210 --> 00:14:20,110
our DMA transfers are smaller.

313
00:14:20,110 --> 00:14:22,873
And maybe that will help
speed up things a bit.

314
00:14:22,873 --> 00:14:25,206
PROFESSOR: So is the whole
thing using double precision?

315
00:14:25,206 --> 00:14:25,789
JOHN CHU: Yes.

316
00:14:25,789 --> 00:14:28,182
It's in double precision.

317
00:14:28,182 --> 00:14:30,662
AUDIENCE: Do you have any
megaflops you're getting?

318
00:14:30,662 --> 00:14:32,150
JAMES GERACI: We have gigaflops.

319
00:14:32,150 --> 00:14:34,630
I think we're getting
around 2 gigaflops.

320
00:14:34,630 --> 00:14:36,614
JOHN CHU: So here's
the gigaflops graph.

321
00:14:39,590 --> 00:14:43,558
AUDIENCE: Do you know what
the expected for [INAUDIBLE]?

322
00:14:43,558 --> 00:14:46,034
JAMES GERACI: [INAUDIBLE]
algorithm bias, just

323
00:14:46,034 --> 00:14:46,534
[INAUDIBLE].

324
00:14:50,530 --> 00:14:52,647
AUDIENCE: This is a linear
scalar on the y-axis?

325
00:14:52,647 --> 00:14:54,188
JAMES GERACI: Yeah,
it's [INAUDIBLE].

326
00:14:57,632 --> 00:15:01,076
AUDIENCE: Any idea why you have
this [INAUDIBLE] at 3 SGUs?

327
00:15:01,076 --> 00:15:05,504
Your double buffering
result just drops.

328
00:15:08,948 --> 00:15:10,916
JAMES GERACI: Actually,
we have no idea.

329
00:15:10,916 --> 00:15:12,392
[INAUDIBLE]

330
00:15:12,392 --> 00:15:15,836
I don't know if we consistently
checked [INAUDIBLE].

331
00:15:15,836 --> 00:15:19,100
AUDIENCE: You might want
to run it a few more times.

332
00:15:19,100 --> 00:15:22,340
There is a recent
series of papers

333
00:15:22,340 --> 00:15:25,620
by Jack Dongarra's
group aimed at

334
00:15:25,620 --> 00:15:28,392
double-precision computations.

335
00:15:28,392 --> 00:15:32,874
They used single precision,
did an approximate guess,

336
00:15:32,874 --> 00:15:36,360
and then did a few iterations
of an iterative approach

337
00:15:36,360 --> 00:15:38,352
to get a
double-precision result.

338
00:15:38,352 --> 00:15:42,834
JAMES GERACI: Yeah, those don't
work very well for [INAUDIBLE].

339
00:15:42,834 --> 00:15:45,324
And this is conditioned
around 10 to the 8.

340
00:15:45,324 --> 00:15:50,304
So that's kind of out of the
ballpark of the [INAUDIBLE].

341
00:15:50,304 --> 00:15:53,790
SUDARSHAN RAGHUNATHAN: And
the particle [INAUDIBLE]

342
00:15:53,790 --> 00:15:56,778
that they do, they tried to
do [INAUDIBLE] computation.

343
00:15:56,778 --> 00:15:59,266
And the main reason they do
[INAUDIBLE] because they say

344
00:15:59,266 --> 00:15:59,766
[INAUDIBLE].

345
00:16:06,240 --> 00:16:09,380
And then in doing
this, the hope is

346
00:16:09,380 --> 00:16:12,430
to generalize this to
sparse matrices later on.

347
00:16:12,430 --> 00:16:16,690
And I think they
are trying to do

348
00:16:16,690 --> 00:16:19,611
some work with sparse matrices,
but they haven't gotten there.

349
00:16:22,374 --> 00:16:23,790
AUDIENCE: One
additional question.

350
00:16:23,790 --> 00:16:26,890
Just at a high level,
is this the right way

351
00:16:26,890 --> 00:16:28,460
to tackle this
particular problem?

352
00:16:28,460 --> 00:16:32,660
It strikes me that there must
be a closed-form high-level

353
00:16:32,660 --> 00:16:37,480
microscopic model for
battery self-discharge.

354
00:16:37,480 --> 00:16:39,830
And I don't know anything
about the chemistry lead

355
00:16:39,830 --> 00:16:42,520
acid batteries, but the
little write-up on the website

356
00:16:42,520 --> 00:16:45,080
mentions Black-Scholes, which
I do know something about.

357
00:16:45,080 --> 00:16:48,346
And of course, there are
efficient ways of doing that.

358
00:16:48,346 --> 00:16:50,290
JAMES GERACI: Sure,
there are efficient ways

359
00:16:50,290 --> 00:16:51,262
of doing Black-Scholes.

360
00:16:51,262 --> 00:16:52,887
There are different
ways of doing this.

361
00:16:52,887 --> 00:16:55,249
This is only one of
the equations that's

362
00:16:55,249 --> 00:16:56,290
similar to Black-Scholes.

363
00:16:56,290 --> 00:16:58,100
There are multiple
equations that go into this.

364
00:16:58,100 --> 00:16:58,933
They're all coupled.

365
00:16:58,933 --> 00:17:00,000
They're all nonlinear.

366
00:17:00,000 --> 00:17:03,120
So solving this
system of equations

367
00:17:03,120 --> 00:17:06,839
is, in the literature, if
you were going to choose

368
00:17:06,839 --> 00:17:07,994
this as your model, is--

369
00:17:07,994 --> 00:17:09,410
AUDIENCE: But if
you were actually

370
00:17:09,410 --> 00:17:13,907
in the business of building
and operating batteries--

371
00:17:13,907 --> 00:17:15,240
JAMES GERACI: The problem with--

372
00:17:15,240 --> 00:17:17,710
AUDIENCE: --wouldn't you try
and come up with an efficient

373
00:17:17,710 --> 00:17:19,397
microscopic model rather than--

374
00:17:19,397 --> 00:17:21,980
JAMES GERACI: Well, it depends
on what your compute costs are.

375
00:17:21,980 --> 00:17:24,791
I mean, if you're going to
run a model on a superscalar

376
00:17:24,791 --> 00:17:26,540
processor, that' very
expensive processor.

377
00:17:26,540 --> 00:17:28,081
If, however, now
you have the ability

378
00:17:28,081 --> 00:17:31,070
to use a collection of small,
cheap little processors

379
00:17:31,070 --> 00:17:32,790
like this, you can
maybe have the luxury

380
00:17:32,790 --> 00:17:34,602
of using a much more
sophisticated model.

381
00:17:34,602 --> 00:17:36,560
AUDIENCE: Is it necessarily
more sophisticated?

382
00:17:36,560 --> 00:17:38,380
I mean, is there some
fundamental reason

383
00:17:38,380 --> 00:17:41,774
why you'd get more information
out of this approach than

384
00:17:41,774 --> 00:17:43,864
out of a higher level model?

385
00:17:43,864 --> 00:17:46,030
JAMES GERACI: You get a lot
more detail as to what's

386
00:17:46,030 --> 00:17:47,050
going on within your battery.

387
00:17:47,050 --> 00:17:49,300
A higher level model will
give you I/O information.

388
00:17:49,300 --> 00:17:51,424
But if you're trying to
find failure mechanisms,

389
00:17:51,424 --> 00:17:53,090
you may want to know
how much pressure's

390
00:17:53,090 --> 00:17:54,570
going on at certain
points in your battery.

391
00:17:54,570 --> 00:17:56,270
Especially with
lithium ion batteries,

392
00:17:56,270 --> 00:17:58,400
when you-- this differs
from a lithium ion

393
00:17:58,400 --> 00:18:01,035
battery in that we're actually
changing the state, the type

394
00:18:01,035 --> 00:18:02,160
of matter that we're using.

395
00:18:02,160 --> 00:18:03,534
And in lithium
ion battery, we're

396
00:18:03,534 --> 00:18:04,960
intercalating and
deintercalating.

397
00:18:04,960 --> 00:18:06,720
And so what we're doing is
we're creating a lot of stress

398
00:18:06,720 --> 00:18:08,240
on a matrix, a crystal.

399
00:18:08,240 --> 00:18:10,210
And that will be a
failure mechanism.

400
00:18:10,210 --> 00:18:12,790
A simple input/output
relationship

401
00:18:12,790 --> 00:18:14,659
will not give you that
failure mechanism.

402
00:18:14,659 --> 00:18:16,700
So you may want to know
that kind of information.

403
00:18:16,700 --> 00:18:18,820
You may want to know
heat information also.

404
00:18:18,820 --> 00:18:21,490
You could include
thermodynamical information

405
00:18:21,490 --> 00:18:22,430
to this kind of model.

406
00:18:22,430 --> 00:18:24,220
AUDIENCE: OK, that's
a good answer.

407
00:18:24,220 --> 00:18:27,990
I take that to mean that
the reason for doing this

408
00:18:27,990 --> 00:18:32,120
is if your researching
the physics of battery

409
00:18:32,120 --> 00:18:38,305
discharge as opposed to trying
to estimate how much power's

410
00:18:38,305 --> 00:18:39,180
left in your battery.

411
00:18:39,180 --> 00:18:41,485
JAMES GERACI: Well, if you
have the computation power,

412
00:18:41,485 --> 00:18:42,860
you could use the
same model that

413
00:18:42,860 --> 00:18:44,650
does the research as the car.

414
00:18:44,650 --> 00:18:48,114
And so if you-- yeah, if you
can use small processors.

415
00:18:48,114 --> 00:18:50,260
So our demonstration--
if you guys want to see

416
00:18:50,260 --> 00:18:52,330
print def statements.

417
00:18:52,330 --> 00:18:52,830
Otherwise--

418
00:18:52,830 --> 00:18:54,318
PROFESSOR: No, I
think we're over time.

419
00:18:54,318 --> 00:18:55,109
JAMES GERACI: Yeah.

420
00:18:55,109 --> 00:18:56,480
PROFESSOR: Any last questions?

421
00:18:56,480 --> 00:18:57,170
AUDIENCE: Wait,
just so I understand

422
00:18:57,170 --> 00:18:59,040
the sparse versus dense
issue, if you just

423
00:18:59,040 --> 00:19:01,770
do a sparse elimination
on a uniprocessor,

424
00:19:01,770 --> 00:19:03,070
do you get a bigger speedup?

425
00:19:03,070 --> 00:19:04,085
JAMES GERACI: Oh, you get
a much bigger speedup.

426
00:19:04,085 --> 00:19:04,640
AUDIENCE: OK.

427
00:19:04,640 --> 00:19:07,899
JAMES GERACI: The LU is
about the slowest thing

428
00:19:07,899 --> 00:19:09,690
you could possibly
choose, but if it works.

429
00:19:09,690 --> 00:19:10,620
AUDIENCE: OK, but what
you have would still

430
00:19:10,620 --> 00:19:12,210
usable for dense matrices?

431
00:19:12,210 --> 00:19:13,565
JAMES GERACI: It'd still be
useful for dense matrices,

432
00:19:13,565 --> 00:19:13,630
yeah.

433
00:19:13,630 --> 00:19:15,588
AUDIENCE: And do you
compare it all to just PPU

434
00:19:15,588 --> 00:19:17,472
performance on a dense matrix?

435
00:19:17,472 --> 00:19:19,555
JAMES GERACI: No, we didn't
compare it to the PPU.

436
00:19:19,555 --> 00:19:23,776
We had it compared-- the same
model I'm running on a PC.

437
00:19:23,776 --> 00:19:25,174
AUDIENCE: Yeah.

438
00:19:25,174 --> 00:19:27,340
PROFESSOR: OK,
thank you, speakers.

439
00:19:27,340 --> 00:19:30,090
[APPLAUSE]