1 00:00:00,040 --> 00:00:02,480 The following content is provided under a Creative 2 00:00:02,480 --> 00:00:04,000 Commons license. 3 00:00:04,000 --> 00:00:06,340 Your support will help MIT OpenCourseWare 4 00:00:06,340 --> 00:00:10,710 continue to offer high quality educational resources for free. 5 00:00:10,710 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,197 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,197 --> 00:00:17,822 at ocw.mit.edu. 8 00:00:22,429 --> 00:00:23,970 JAMES GERACI: My name's James Geraci. 9 00:00:23,970 --> 00:00:28,280 I worked with Sudarshan and John Chu on this project. 10 00:00:28,280 --> 00:00:31,944 And what we did is we took a numerical simulation, 11 00:00:31,944 --> 00:00:33,610 electrochemical simulation, of a battery 12 00:00:33,610 --> 00:00:38,620 and ported it to the Playstation 3. 13 00:00:38,620 --> 00:00:42,070 So the objective of this project for this situation 14 00:00:42,070 --> 00:00:45,730 was to port an electrochemical simulation to the PlayStation 3 15 00:00:45,730 --> 00:00:48,480 and try to take advantage of the unique features of the cell 16 00:00:48,480 --> 00:00:53,660 processor to help speed up the computation of the simulation. 17 00:00:53,660 --> 00:00:56,335 So when we talk to everyone, we talk why we'd want to do that. 18 00:00:56,335 --> 00:00:57,710 We're going to discuss the model. 19 00:00:57,710 --> 00:01:00,704 We're going to discuss how we solved it numerically, 20 00:01:00,704 --> 00:01:02,370 look at some of our performance results, 21 00:01:02,370 --> 00:01:04,703 and then discuss what we wish we could have implemented. 22 00:01:07,420 --> 00:01:10,700 So the justification is basically the world needs 23 00:01:10,700 --> 00:01:14,070 energy, and oil has been in the news a lot 24 00:01:14,070 --> 00:01:18,950 and hasn't been as a reliable source of energy as we'd like. 25 00:01:18,950 --> 00:01:20,780 So in the automotive industry, hybrids 26 00:01:20,780 --> 00:01:25,160 seem to be part of the solution to the energy needs. 27 00:01:25,160 --> 00:01:27,180 Now, one the problems of hybrids is how much 28 00:01:27,180 --> 00:01:28,590 energy is in your battery. 29 00:01:28,590 --> 00:01:29,980 So the more sophisticated battery 30 00:01:29,980 --> 00:01:32,320 model that we can put in a hybrid, the better 31 00:01:32,320 --> 00:01:34,280 that we're going to be off. 32 00:01:34,280 --> 00:01:39,180 So a more sophisticated model requires more computation, 33 00:01:39,180 --> 00:01:41,610 so you're looking at having a large superscalar 34 00:01:41,610 --> 00:01:44,020 processor in your car, or you could 35 00:01:44,020 --> 00:01:46,160 see if you could get a lot of small processors 36 00:01:46,160 --> 00:01:47,830 to do the same amount of work. 37 00:01:47,830 --> 00:01:50,300 And the cell represents the first opportunity 38 00:01:50,300 --> 00:01:53,150 be able to test that out. 39 00:01:53,150 --> 00:01:56,550 So the simulation that we ran basically 40 00:01:56,550 --> 00:01:59,460 consisted of a lead acid battery cell. 41 00:01:59,460 --> 00:02:01,430 So unfortunately, they used the word "cell," 42 00:02:01,430 --> 00:02:03,210 and we're also using the word "cell." 43 00:02:03,210 --> 00:02:07,250 So it's sort of like the modern day Smurf. 44 00:02:07,250 --> 00:02:09,389 So in a lead-acid battery cell, you 45 00:02:09,389 --> 00:02:11,580 have a lead dioxide electrode, which 46 00:02:11,580 --> 00:02:14,470 consists of a number of parts; a lead electrode, 47 00:02:14,470 --> 00:02:16,220 which also consists of a number of parts; 48 00:02:16,220 --> 00:02:19,450 and an electrolyte, which is 5 molar sulfuric acid, which 49 00:02:19,450 --> 00:02:20,430 consists of H2SO4. 50 00:02:24,530 --> 00:02:27,830 When you discharge the battery, you're 51 00:02:27,830 --> 00:02:29,744 going to close the circuit. 52 00:02:29,744 --> 00:02:31,410 And basically, what ends up happening is 53 00:02:31,410 --> 00:02:34,244 you have electrons flying on the outer loop. 54 00:02:34,244 --> 00:02:35,660 And between the electrodes, you're 55 00:02:35,660 --> 00:02:38,180 going to have an ionic current. 56 00:02:38,180 --> 00:02:41,900 So you have a complete circuit around your loop. 57 00:02:41,900 --> 00:02:43,970 And this is the effect that the simulation 58 00:02:43,970 --> 00:02:46,910 is trying to simulate. 59 00:02:46,910 --> 00:02:53,832 The simulation works on a very low physical level, 60 00:02:53,832 --> 00:02:55,790 and it stimulates the chemical reactions, which 61 00:02:55,790 --> 00:02:57,180 at the lead dioxide electrode, it 62 00:02:57,180 --> 00:03:00,700 converts lead dioxide into lead sulfate during discharge, 63 00:03:00,700 --> 00:03:02,420 reverses it during charge. 64 00:03:02,420 --> 00:03:05,280 During discharge, the lead electrode 65 00:03:05,280 --> 00:03:06,930 turns lead into lead sulfate. 66 00:03:06,930 --> 00:03:09,475 And during recharge, it converts that back. 67 00:03:13,810 --> 00:03:16,760 To implement the model, we took basically and discretized 68 00:03:16,760 --> 00:03:19,330 the system and just kind of cross-sectioned 69 00:03:19,330 --> 00:03:23,864 it-- took small sections and cross-sectioned it. 70 00:03:23,864 --> 00:03:25,155 We're going to skip this slide. 71 00:03:25,155 --> 00:03:26,238 I don't know what that is. 72 00:03:26,238 --> 00:03:28,730 But basically, to simulate the physics, 73 00:03:28,730 --> 00:03:30,820 we have a bunch of nasty, nonlinear coupled 74 00:03:30,820 --> 00:03:32,560 partial differential equations. 75 00:03:32,560 --> 00:03:34,580 This is an example of two of them. 76 00:03:34,580 --> 00:03:36,850 And there are a number of other ones 77 00:03:36,850 --> 00:03:41,270 that are included in the system but not really worth seeing. 78 00:03:41,270 --> 00:03:45,710 Just kind of a feel, you see a diffusion term. 79 00:03:45,710 --> 00:03:47,710 In the lower equation, you see a diffusion term. 80 00:03:47,710 --> 00:03:49,280 You see a reaction term. 81 00:03:49,280 --> 00:03:53,760 And the upper equation, you see changing of the porosity, 82 00:03:53,760 --> 00:03:56,510 changing the geometry of the system. 83 00:03:56,510 --> 00:03:58,000 This is kind of the output that you 84 00:03:58,000 --> 00:04:01,604 could get from this type of battery simulation. 85 00:04:01,604 --> 00:04:03,020 And what you're looking at here is 86 00:04:03,020 --> 00:04:05,780 you're looking at-- if the left-hand side is 87 00:04:05,780 --> 00:04:10,710 the electrodes, is the top of the battery, 88 00:04:10,710 --> 00:04:13,670 and the height of the battery goes along that way, 89 00:04:13,670 --> 00:04:17,600 you see that as you discharge, the concentration 90 00:04:17,600 --> 00:04:20,360 at the top of the electrodes is much-- the acid is 91 00:04:20,360 --> 00:04:23,410 consumed where you're drawing current out most readily. 92 00:04:23,410 --> 00:04:26,830 So physically, the simulation kind of makes sense, 93 00:04:26,830 --> 00:04:29,860 even if my speech didn't. 94 00:04:29,860 --> 00:04:33,350 The battery model that we used is a two-dimensional model. 95 00:04:33,350 --> 00:04:37,090 It has both a lead dioxide electrode and a lead electrode 96 00:04:37,090 --> 00:04:38,500 and an electrolyte area. 97 00:04:38,500 --> 00:04:41,787 This is an example of just the lead dioxide electrode. 98 00:04:41,787 --> 00:04:43,620 And if we discretize that in two dimensions, 99 00:04:43,620 --> 00:04:46,500 we would get a discretization that 100 00:04:46,500 --> 00:04:48,220 looks like that if we were to use 101 00:04:48,220 --> 00:04:51,469 a single-centered finite-volume method. 102 00:04:51,469 --> 00:04:54,010 But it turns out that we have so many discontinuities in here 103 00:04:54,010 --> 00:04:56,340 that it's much better to use a staggered grid. 104 00:04:56,340 --> 00:05:02,230 And so we're using actually two different meshes 105 00:05:02,230 --> 00:05:05,270 to simulate this system. 106 00:05:05,270 --> 00:05:07,050 Well, on a single mesh, you would end up 107 00:05:07,050 --> 00:05:10,370 having four corners, four edges and a center. 108 00:05:10,370 --> 00:05:13,070 So you would get nine different types of objects 109 00:05:13,070 --> 00:05:14,650 and one electrode alone. 110 00:05:14,650 --> 00:05:17,890 And you're talking about a rather large and nasty 111 00:05:17,890 --> 00:05:19,337 simulation. 112 00:05:19,337 --> 00:05:20,920 On a small scale, the simulation would 113 00:05:20,920 --> 00:05:24,480 produce a very sparse matrix that looks something like this. 114 00:05:24,480 --> 00:05:26,980 And Sudarshan will now come up and talk 115 00:05:26,980 --> 00:05:31,157 to you about the matrices and how we solved this system. 116 00:05:37,882 --> 00:05:39,923 SUDARSHAN RAGHUNATHAN: [INAUDIBLE] slight context 117 00:05:39,923 --> 00:05:43,332 [INAUDIBLE] speakers. 118 00:05:43,332 --> 00:05:47,715 So basically, like John said, the problem, we are essentially 119 00:05:47,715 --> 00:05:50,340 trying to parallelize the most computationally expensive part 120 00:05:50,340 --> 00:05:52,470 of this whole simulation, this solving 121 00:05:52,470 --> 00:05:57,510 for the Newton iterations during each [? lower ?] step. 122 00:05:57,510 --> 00:06:00,397 And so basically, in an abstract-- I'm sorry. 123 00:06:00,397 --> 00:06:00,980 Any questions? 124 00:06:00,980 --> 00:06:01,560 AUDIENCE: Speak louder. 125 00:06:01,560 --> 00:06:02,298 SUDARSHAN RAGHUNATHAN: Sorry? 126 00:06:02,298 --> 00:06:03,254 AUDIENCE: Louder. 127 00:06:03,254 --> 00:06:03,492 AUDIENCE: [INAUDIBLE]. 128 00:06:03,492 --> 00:06:04,210 The microphone [INAUDIBLE]. 129 00:06:04,210 --> 00:06:05,260 SUDARSHAN RAGHUNATHAN: I'll try, but you've got 130 00:06:05,260 --> 00:06:06,469 to try and listen harder too. 131 00:06:06,469 --> 00:06:08,051 PROFESSOR: That only works for the TV. 132 00:06:08,051 --> 00:06:09,100 You have to yell at them. 133 00:06:09,100 --> 00:06:10,076 SUDARSHAN RAGHUNATHAN: I got it, yeah. 134 00:06:10,076 --> 00:06:11,254 Is it any better now? 135 00:06:11,254 --> 00:06:13,004 PROFESSOR: No, you've got to yell at them. 136 00:06:13,004 --> 00:06:13,980 SUDARSHAN RAGHUNATHAN: Is it any better now? 137 00:06:13,980 --> 00:06:14,468 AUDIENCE: Yeah. 138 00:06:14,468 --> 00:06:15,160 SUDARSHAN RAGHUNATHAN: OK, cool. 139 00:06:15,160 --> 00:06:16,730 PROFESSOR: Yeah, I can hear. 140 00:06:16,730 --> 00:06:17,855 SUDARSHAN RAGHUNATHAN: So-- 141 00:06:17,855 --> 00:06:18,728 [LAUGHTER] 142 00:06:19,228 --> 00:06:22,070 Yeah, I always like to say I'll try to speak harder. 143 00:06:22,070 --> 00:06:23,611 You should try to listen harder, too. 144 00:06:23,611 --> 00:06:27,510 So we meet somewhere in between. 145 00:06:27,510 --> 00:06:29,360 So basically in an abstract sense, 146 00:06:29,360 --> 00:06:32,680 we are trying to solve a system of sparse linear equations. 147 00:06:32,680 --> 00:06:35,300 And for simplicity, for the first pass, 148 00:06:35,300 --> 00:06:37,510 we treated the matrix as being dense. 149 00:06:37,510 --> 00:06:41,900 So we did the whole Gauss elimination ignoring 150 00:06:41,900 --> 00:06:44,155 all the zeroes of the matrix. 151 00:06:44,155 --> 00:06:48,240 And so the basic idea is that we do Gauss elimination 152 00:06:48,240 --> 00:06:49,400 with partial pivoting. 153 00:06:49,400 --> 00:06:52,240 And it's the forward elimination stage of the Gauss elimination 154 00:06:52,240 --> 00:06:54,390 that [? Rn cubed ?] that is most expensive. 155 00:06:54,390 --> 00:06:56,390 And that is the thing we've tried to parallelize 156 00:06:56,390 --> 00:06:59,020 between among the SPUs. 157 00:06:59,020 --> 00:07:01,940 And the partial pivoting and the back subs are done on the PPUs 158 00:07:01,940 --> 00:07:04,090 because that turned out to be much faster. 159 00:07:08,660 --> 00:07:12,530 So this is the matrix that we are considering 160 00:07:12,530 --> 00:07:14,500 during each step for the random entries 161 00:07:14,500 --> 00:07:18,922 and for the random right-hand side. 162 00:07:18,922 --> 00:07:22,040 And then so the way Gauss elimination works 163 00:07:22,040 --> 00:07:25,360 is that we start off with one of the rows 164 00:07:25,360 --> 00:07:26,820 that you call the base row, which 165 00:07:26,820 --> 00:07:28,069 is where the pivots come from. 166 00:07:31,260 --> 00:07:33,630 And then we eliminate the variable corresponding 167 00:07:33,630 --> 00:07:36,610 to base row from all the subsequent elimination rows. 168 00:07:36,610 --> 00:07:38,770 And all these eliminations can be done in parallel. 169 00:07:38,770 --> 00:07:40,645 And that's what we have tried to parallelize. 170 00:07:40,645 --> 00:07:45,290 So each SPU grabs hold of the next available elimination row, 171 00:07:45,290 --> 00:07:47,860 eliminates it, and writes a result back 172 00:07:47,860 --> 00:07:49,470 on to the main memory. 173 00:07:49,470 --> 00:07:51,524 So if, for example, if we consider 174 00:07:51,524 --> 00:07:54,810 the situation with the two SPUs, what we do 175 00:07:54,810 --> 00:08:00,000 is that we stream the elimination rows 176 00:08:00,000 --> 00:08:01,140 across all the SPUs. 177 00:08:01,140 --> 00:08:02,834 Each SPU performs the elimination 178 00:08:02,834 --> 00:08:07,810 and writes the result back to main memory, as you can see. 179 00:08:07,810 --> 00:08:12,010 And this happens for every row. 180 00:08:12,010 --> 00:08:15,577 So in this simulation, the first variable has been eliminated. 181 00:08:15,577 --> 00:08:17,910 As you can see, there are zeroes along the first column. 182 00:08:17,910 --> 00:08:20,680 And the process is recursive. 183 00:08:20,680 --> 00:08:23,050 After the first variable is completed, 184 00:08:23,050 --> 00:08:26,540 it moves on to the next variable. 185 00:08:26,540 --> 00:08:29,729 And the way the elimination rows are returned, 186 00:08:29,729 --> 00:08:31,395 they're delivered in a cyclical fashion. 187 00:08:31,395 --> 00:08:34,327 So there is no serial bottleneck. 188 00:08:34,327 --> 00:08:35,860 It's all perfectly load balanced. 189 00:08:35,860 --> 00:08:37,000 PROFESSOR: Have you-- 190 00:08:37,000 --> 00:08:37,330 SUDARSHAN RAGHUNATHAN: I'm sorry? 191 00:08:37,330 --> 00:08:38,460 PROFESSOR: --change? 192 00:08:38,460 --> 00:08:38,809 SUDARSHAN RAGHUNATHAN: I'm sorry? 193 00:08:38,809 --> 00:08:40,243 PROFESSOR: If you eliminate something lower, 194 00:08:40,243 --> 00:08:41,677 doesn't the pivot value change? 195 00:08:41,677 --> 00:08:44,469 I'm completely not getting what you are saying. 196 00:08:44,469 --> 00:08:45,510 That's my first question. 197 00:08:45,510 --> 00:08:45,920 SUDARSHAN RAGHUNATHAN: Yes. 198 00:08:45,920 --> 00:08:48,334 PROFESSOR: Second question is, is this sparse or dense? 199 00:08:48,334 --> 00:08:49,750 What's your matrix representation? 200 00:08:49,750 --> 00:08:51,070 SUDARSHAN RAGHUNATHAN: The matrix representation currently 201 00:08:51,070 --> 00:08:52,460 is dense. 202 00:08:52,460 --> 00:08:53,320 PROFESSOR: OK. 203 00:08:53,320 --> 00:08:54,260 SUDARSHAN RAGHUNATHAN: Yeah, so we've-- 204 00:08:54,260 --> 00:08:55,244 PROFESSOR: Doesn't the pivot value 205 00:08:55,244 --> 00:08:56,720 keep changing when you eliminate? 206 00:08:56,720 --> 00:08:59,672 So if it's completely parallel, and doesn't that [INAUDIBLE]? 207 00:09:03,390 --> 00:09:05,920 SUDARSHAN RAGHUNATHAN: So all the subsequent variables 208 00:09:05,920 --> 00:09:10,213 that are eliminated for each variable, 209 00:09:10,213 --> 00:09:12,020 all can be done in parallel. 210 00:09:12,020 --> 00:09:13,940 But after one variable has been eliminated, 211 00:09:13,940 --> 00:09:17,545 there's a synchronization step where all the SPUs 212 00:09:17,545 --> 00:09:18,770 have to synchronize. 213 00:09:18,770 --> 00:09:23,114 There's a barrier after each row has been eliminated. 214 00:09:23,114 --> 00:09:24,214 PROFESSOR: OK. 215 00:09:24,214 --> 00:09:26,130 SUDARSHAN RAGHUNATHAN: Yeah, so you start it-- 216 00:09:26,130 --> 00:09:27,800 so there's a totally [INAUDIBLE], 217 00:09:27,800 --> 00:09:30,160 if you write the algorithm down. 218 00:09:30,160 --> 00:09:32,808 And then it's the inner two loops that 219 00:09:32,808 --> 00:09:34,058 are totally parallelized with. 220 00:09:34,058 --> 00:09:37,443 And the second loop is what we're trying to parallelize. 221 00:09:37,443 --> 00:09:38,026 PROFESSOR: OK. 222 00:09:38,026 --> 00:09:39,514 SUDARSHAN RAGHUNATHAN: Is that a little clearer? 223 00:09:39,514 --> 00:09:40,014 Yeah. 224 00:09:40,014 --> 00:09:43,482 So this is the basic algorithm. 225 00:09:43,482 --> 00:09:47,946 And I'll let John talk about some of the performance numbers 226 00:09:47,946 --> 00:09:48,938 we've got. 227 00:09:48,938 --> 00:09:53,820 But some of the optimizations that we haven't done 228 00:09:53,820 --> 00:09:56,070 are the use of the [? last three ?] operations, 229 00:09:56,070 --> 00:09:59,855 like extreme multiple rows at a time and use 230 00:09:59,855 --> 00:10:01,316 of [INAUDIBLE] operations. 231 00:10:01,316 --> 00:10:03,751 But I'll let John talk about the performance numbers. 232 00:10:16,340 --> 00:10:18,340 JOHN CHU: Computation of the current elimination 233 00:10:18,340 --> 00:10:22,190 row with the fetching of the next elimination row, 234 00:10:22,190 --> 00:10:24,550 we get roughly a 30% speedup. 235 00:10:24,550 --> 00:10:28,046 So here's a graph to see how both 236 00:10:28,046 --> 00:10:33,363 the unbuffered and buffered version perform scattered 237 00:10:33,363 --> 00:10:35,040 with a number of SPUs. 238 00:10:35,040 --> 00:10:39,570 So as you can see, both look roughly pretty linear. 239 00:10:39,570 --> 00:10:44,478 And that makes sense because the-- 240 00:10:54,740 --> 00:11:00,210 So here's another graph of how we 241 00:11:00,210 --> 00:11:03,640 scaled with the size of the matrix that we're working with. 242 00:11:03,640 --> 00:11:05,750 So the smaller the matrix, the higher 243 00:11:05,750 --> 00:11:08,150 the communication/computation ratio. 244 00:11:08,150 --> 00:11:10,570 So the communication latency should be more apparent. 245 00:11:10,570 --> 00:11:18,322 But as you can see, even as you get smaller, 246 00:11:18,322 --> 00:11:22,620 the computations is still pretty linear. 247 00:11:22,620 --> 00:11:23,120 So-- 248 00:11:23,120 --> 00:11:24,614 PROFESSOR: Yeah, it's linear. 249 00:11:24,614 --> 00:11:31,586 But it's not just [INAUDIBLE]. 250 00:11:31,586 --> 00:11:35,072 It's not running at 2x performance improvement. 251 00:11:35,072 --> 00:11:37,562 But [INAUDIBLE] now. 252 00:11:37,562 --> 00:11:41,546 So when you put 16 there, why is it only 30%? 253 00:11:41,546 --> 00:11:44,534 Why is it not leading to 2x performance improvement? 254 00:11:44,534 --> 00:11:46,526 So is it because of underclock? 255 00:11:46,526 --> 00:11:48,518 Because if it's underclocked, your graph 256 00:11:48,518 --> 00:11:50,510 shouldn't have this kind of a pattern. 257 00:11:50,510 --> 00:11:52,998 It should have a more exponential type pattern. 258 00:11:52,998 --> 00:11:53,498 [INAUDIBLE]. 259 00:11:56,189 --> 00:11:57,980 PROFESSOR: Did you understand the question? 260 00:11:57,980 --> 00:11:59,720 JOHN CHU: Yep. 261 00:11:59,720 --> 00:12:03,740 PROFESSOR: So why aren't you getting a 2x speedup? 262 00:12:03,740 --> 00:12:05,988 I mean, your line is straight, but it's not 263 00:12:05,988 --> 00:12:07,944 increasing proportional to the number of SPUs 264 00:12:07,944 --> 00:12:08,694 you're increasing. 265 00:12:08,694 --> 00:12:10,662 So your efficiency might not be at 100%. 266 00:12:10,662 --> 00:12:12,650 Any idea why? 267 00:12:12,650 --> 00:12:15,904 JAMES GERACI: Yeah, that [? looks ?] horrible. 268 00:12:15,904 --> 00:12:17,320 JOHN CHU: The 2x is only-- this is 269 00:12:17,320 --> 00:12:20,073 [INAUDIBLE] [? laptop ?], so two would only 270 00:12:20,073 --> 00:12:22,006 shift it up and down, right? 271 00:12:22,006 --> 00:12:24,006 PROFESSOR: So it looks like you're not operating 272 00:12:24,006 --> 00:12:26,466 at 100% efficiency, right? 273 00:12:26,466 --> 00:12:27,841 JAMES GERACI: Oh, definitely not. 274 00:12:27,841 --> 00:12:28,280 JOHN CHU: Yeah. 275 00:12:28,280 --> 00:12:29,199 PROFESSOR: So I think to-- 276 00:12:29,199 --> 00:12:31,324 PROFESSOR: [INAUDIBLE] conversation going on there. 277 00:12:31,324 --> 00:12:33,690 But this a strange type of a graph 278 00:12:33,690 --> 00:12:40,675 because if [INAUDIBLE] as you get a very nice linear 279 00:12:40,675 --> 00:12:41,175 [INAUDIBLE]. 280 00:12:41,175 --> 00:12:44,418 If you have [? undersize ?] coordinate, what you get 281 00:12:44,418 --> 00:12:46,704 is you have a good speed at very beginning, 282 00:12:46,704 --> 00:12:47,662 and then you slow down. 283 00:12:47,662 --> 00:12:50,157 You get a [INAUDIBLE] type of a graph. 284 00:12:50,157 --> 00:12:52,402 So it's very much-- I haven't seen things 285 00:12:52,402 --> 00:12:58,803 like a strange line, but not as [INAUDIBLE] on your graph. 286 00:12:58,803 --> 00:13:00,636 This is kind of a strange thing. [INAUDIBLE] 287 00:13:00,636 --> 00:13:03,630 to see where and why does it happen. 288 00:13:03,630 --> 00:13:05,130 JOHN CHU: OK. 289 00:13:05,130 --> 00:13:08,720 Well, yeah, so I'll just move on. 290 00:13:08,720 --> 00:13:10,780 So I'll talk about some of the further work 291 00:13:10,780 --> 00:13:12,150 we're trying to do. 292 00:13:12,150 --> 00:13:12,816 PROFESSOR: John? 293 00:13:12,816 --> 00:13:13,441 JOHN CHU: Yeah? 294 00:13:13,441 --> 00:13:19,440 PROFESSOR: Can you clip the mic to your-- yeah, that'll work. 295 00:13:19,440 --> 00:13:23,030 JOHN CHU: So we've also tried to triple buffer 296 00:13:23,030 --> 00:13:27,900 by pipelining the fetching of the next row 297 00:13:27,900 --> 00:13:29,650 with the computation of the current row 298 00:13:29,650 --> 00:13:33,332 with the sending of the previous row. 299 00:13:33,332 --> 00:13:34,790 We did this, but it didn't actually 300 00:13:34,790 --> 00:13:39,136 give us the speedup that we thought we would get. 301 00:13:39,136 --> 00:13:42,850 But what we're working on how is SIMDizing the computation 302 00:13:42,850 --> 00:13:43,660 stage. 303 00:13:43,660 --> 00:13:49,010 So right now, we treat all the doubles in the rows as scalars. 304 00:13:49,010 --> 00:13:53,790 But by SIMDizing, hopefully we can get a speedup 305 00:13:53,790 --> 00:13:55,690 by a factor of two. 306 00:13:55,690 --> 00:13:58,390 And also right now, our LU algorithm 307 00:13:58,390 --> 00:14:01,880 isn't very smart in that when we fetch the elimination 308 00:14:01,880 --> 00:14:04,990 row, as we go on through the algorithm, 309 00:14:04,990 --> 00:14:08,750 part of the elimination row becomes zeroes as we move down. 310 00:14:08,750 --> 00:14:13,860 So we don't really need to fetch the entire elimination row. 311 00:14:13,860 --> 00:14:18,210 And so by taking that into account, 312 00:14:18,210 --> 00:14:20,110 our DMA transfers are smaller. 313 00:14:20,110 --> 00:14:22,873 And maybe that will help speed up things a bit. 314 00:14:22,873 --> 00:14:25,206 PROFESSOR: So is the whole thing using double precision? 315 00:14:25,206 --> 00:14:25,789 JOHN CHU: Yes. 316 00:14:25,789 --> 00:14:28,182 It's in double precision. 317 00:14:28,182 --> 00:14:30,662 AUDIENCE: Do you have any megaflops you're getting? 318 00:14:30,662 --> 00:14:32,150 JAMES GERACI: We have gigaflops. 319 00:14:32,150 --> 00:14:34,630 I think we're getting around 2 gigaflops. 320 00:14:34,630 --> 00:14:36,614 JOHN CHU: So here's the gigaflops graph. 321 00:14:39,590 --> 00:14:43,558 AUDIENCE: Do you know what the expected for [INAUDIBLE]? 322 00:14:43,558 --> 00:14:46,034 JAMES GERACI: [INAUDIBLE] algorithm bias, just 323 00:14:46,034 --> 00:14:46,534 [INAUDIBLE]. 324 00:14:50,530 --> 00:14:52,647 AUDIENCE: This is a linear scalar on the y-axis? 325 00:14:52,647 --> 00:14:54,188 JAMES GERACI: Yeah, it's [INAUDIBLE]. 326 00:14:57,632 --> 00:15:01,076 AUDIENCE: Any idea why you have this [INAUDIBLE] at 3 SGUs? 327 00:15:01,076 --> 00:15:05,504 Your double buffering result just drops. 328 00:15:08,948 --> 00:15:10,916 JAMES GERACI: Actually, we have no idea. 329 00:15:10,916 --> 00:15:12,392 [INAUDIBLE] 330 00:15:12,392 --> 00:15:15,836 I don't know if we consistently checked [INAUDIBLE]. 331 00:15:15,836 --> 00:15:19,100 AUDIENCE: You might want to run it a few more times. 332 00:15:19,100 --> 00:15:22,340 There is a recent series of papers 333 00:15:22,340 --> 00:15:25,620 by Jack Dongarra's group aimed at 334 00:15:25,620 --> 00:15:28,392 double-precision computations. 335 00:15:28,392 --> 00:15:32,874 They used single precision, did an approximate guess, 336 00:15:32,874 --> 00:15:36,360 and then did a few iterations of an iterative approach 337 00:15:36,360 --> 00:15:38,352 to get a double-precision result. 338 00:15:38,352 --> 00:15:42,834 JAMES GERACI: Yeah, those don't work very well for [INAUDIBLE]. 339 00:15:42,834 --> 00:15:45,324 And this is conditioned around 10 to the 8. 340 00:15:45,324 --> 00:15:50,304 So that's kind of out of the ballpark of the [INAUDIBLE]. 341 00:15:50,304 --> 00:15:53,790 SUDARSHAN RAGHUNATHAN: And the particle [INAUDIBLE] 342 00:15:53,790 --> 00:15:56,778 that they do, they tried to do [INAUDIBLE] computation. 343 00:15:56,778 --> 00:15:59,266 And the main reason they do [INAUDIBLE] because they say 344 00:15:59,266 --> 00:15:59,766 [INAUDIBLE]. 345 00:16:06,240 --> 00:16:09,380 And then in doing this, the hope is 346 00:16:09,380 --> 00:16:12,430 to generalize this to sparse matrices later on. 347 00:16:12,430 --> 00:16:16,690 And I think they are trying to do 348 00:16:16,690 --> 00:16:19,611 some work with sparse matrices, but they haven't gotten there. 349 00:16:22,374 --> 00:16:23,790 AUDIENCE: One additional question. 350 00:16:23,790 --> 00:16:26,890 Just at a high level, is this the right way 351 00:16:26,890 --> 00:16:28,460 to tackle this particular problem? 352 00:16:28,460 --> 00:16:32,660 It strikes me that there must be a closed-form high-level 353 00:16:32,660 --> 00:16:37,480 microscopic model for battery self-discharge. 354 00:16:37,480 --> 00:16:39,830 And I don't know anything about the chemistry lead 355 00:16:39,830 --> 00:16:42,520 acid batteries, but the little write-up on the website 356 00:16:42,520 --> 00:16:45,080 mentions Black-Scholes, which I do know something about. 357 00:16:45,080 --> 00:16:48,346 And of course, there are efficient ways of doing that. 358 00:16:48,346 --> 00:16:50,290 JAMES GERACI: Sure, there are efficient ways 359 00:16:50,290 --> 00:16:51,262 of doing Black-Scholes. 360 00:16:51,262 --> 00:16:52,887 There are different ways of doing this. 361 00:16:52,887 --> 00:16:55,249 This is only one of the equations that's 362 00:16:55,249 --> 00:16:56,290 similar to Black-Scholes. 363 00:16:56,290 --> 00:16:58,100 There are multiple equations that go into this. 364 00:16:58,100 --> 00:16:58,933 They're all coupled. 365 00:16:58,933 --> 00:17:00,000 They're all nonlinear. 366 00:17:00,000 --> 00:17:03,120 So solving this system of equations 367 00:17:03,120 --> 00:17:06,839 is, in the literature, if you were going to choose 368 00:17:06,839 --> 00:17:07,994 this as your model, is-- 369 00:17:07,994 --> 00:17:09,410 AUDIENCE: But if you were actually 370 00:17:09,410 --> 00:17:13,907 in the business of building and operating batteries-- 371 00:17:13,907 --> 00:17:15,240 JAMES GERACI: The problem with-- 372 00:17:15,240 --> 00:17:17,710 AUDIENCE: --wouldn't you try and come up with an efficient 373 00:17:17,710 --> 00:17:19,397 microscopic model rather than-- 374 00:17:19,397 --> 00:17:21,980 JAMES GERACI: Well, it depends on what your compute costs are. 375 00:17:21,980 --> 00:17:24,791 I mean, if you're going to run a model on a superscalar 376 00:17:24,791 --> 00:17:26,540 processor, that' very expensive processor. 377 00:17:26,540 --> 00:17:28,081 If, however, now you have the ability 378 00:17:28,081 --> 00:17:31,070 to use a collection of small, cheap little processors 379 00:17:31,070 --> 00:17:32,790 like this, you can maybe have the luxury 380 00:17:32,790 --> 00:17:34,602 of using a much more sophisticated model. 381 00:17:34,602 --> 00:17:36,560 AUDIENCE: Is it necessarily more sophisticated? 382 00:17:36,560 --> 00:17:38,380 I mean, is there some fundamental reason 383 00:17:38,380 --> 00:17:41,774 why you'd get more information out of this approach than 384 00:17:41,774 --> 00:17:43,864 out of a higher level model? 385 00:17:43,864 --> 00:17:46,030 JAMES GERACI: You get a lot more detail as to what's 386 00:17:46,030 --> 00:17:47,050 going on within your battery. 387 00:17:47,050 --> 00:17:49,300 A higher level model will give you I/O information. 388 00:17:49,300 --> 00:17:51,424 But if you're trying to find failure mechanisms, 389 00:17:51,424 --> 00:17:53,090 you may want to know how much pressure's 390 00:17:53,090 --> 00:17:54,570 going on at certain points in your battery. 391 00:17:54,570 --> 00:17:56,270 Especially with lithium ion batteries, 392 00:17:56,270 --> 00:17:58,400 when you-- this differs from a lithium ion 393 00:17:58,400 --> 00:18:01,035 battery in that we're actually changing the state, the type 394 00:18:01,035 --> 00:18:02,160 of matter that we're using. 395 00:18:02,160 --> 00:18:03,534 And in lithium ion battery, we're 396 00:18:03,534 --> 00:18:04,960 intercalating and deintercalating. 397 00:18:04,960 --> 00:18:06,720 And so what we're doing is we're creating a lot of stress 398 00:18:06,720 --> 00:18:08,240 on a matrix, a crystal. 399 00:18:08,240 --> 00:18:10,210 And that will be a failure mechanism. 400 00:18:10,210 --> 00:18:12,790 A simple input/output relationship 401 00:18:12,790 --> 00:18:14,659 will not give you that failure mechanism. 402 00:18:14,659 --> 00:18:16,700 So you may want to know that kind of information. 403 00:18:16,700 --> 00:18:18,820 You may want to know heat information also. 404 00:18:18,820 --> 00:18:21,490 You could include thermodynamical information 405 00:18:21,490 --> 00:18:22,430 to this kind of model. 406 00:18:22,430 --> 00:18:24,220 AUDIENCE: OK, that's a good answer. 407 00:18:24,220 --> 00:18:27,990 I take that to mean that the reason for doing this 408 00:18:27,990 --> 00:18:32,120 is if your researching the physics of battery 409 00:18:32,120 --> 00:18:38,305 discharge as opposed to trying to estimate how much power's 410 00:18:38,305 --> 00:18:39,180 left in your battery. 411 00:18:39,180 --> 00:18:41,485 JAMES GERACI: Well, if you have the computation power, 412 00:18:41,485 --> 00:18:42,860 you could use the same model that 413 00:18:42,860 --> 00:18:44,650 does the research as the car. 414 00:18:44,650 --> 00:18:48,114 And so if you-- yeah, if you can use small processors. 415 00:18:48,114 --> 00:18:50,260 So our demonstration-- if you guys want to see 416 00:18:50,260 --> 00:18:52,330 print def statements. 417 00:18:52,330 --> 00:18:52,830 Otherwise-- 418 00:18:52,830 --> 00:18:54,318 PROFESSOR: No, I think we're over time. 419 00:18:54,318 --> 00:18:55,109 JAMES GERACI: Yeah. 420 00:18:55,109 --> 00:18:56,480 PROFESSOR: Any last questions? 421 00:18:56,480 --> 00:18:57,170 AUDIENCE: Wait, just so I understand 422 00:18:57,170 --> 00:18:59,040 the sparse versus dense issue, if you just 423 00:18:59,040 --> 00:19:01,770 do a sparse elimination on a uniprocessor, 424 00:19:01,770 --> 00:19:03,070 do you get a bigger speedup? 425 00:19:03,070 --> 00:19:04,085 JAMES GERACI: Oh, you get a much bigger speedup. 426 00:19:04,085 --> 00:19:04,640 AUDIENCE: OK. 427 00:19:04,640 --> 00:19:07,899 JAMES GERACI: The LU is about the slowest thing 428 00:19:07,899 --> 00:19:09,690 you could possibly choose, but if it works. 429 00:19:09,690 --> 00:19:10,620 AUDIENCE: OK, but what you have would still 430 00:19:10,620 --> 00:19:12,210 usable for dense matrices? 431 00:19:12,210 --> 00:19:13,565 JAMES GERACI: It'd still be useful for dense matrices, 432 00:19:13,565 --> 00:19:13,630 yeah. 433 00:19:13,630 --> 00:19:15,588 AUDIENCE: And do you compare it all to just PPU 434 00:19:15,588 --> 00:19:17,472 performance on a dense matrix? 435 00:19:17,472 --> 00:19:19,555 JAMES GERACI: No, we didn't compare it to the PPU. 436 00:19:19,555 --> 00:19:23,776 We had it compared-- the same model I'm running on a PC. 437 00:19:23,776 --> 00:19:25,174 AUDIENCE: Yeah. 438 00:19:25,174 --> 00:19:27,340 PROFESSOR: OK, thank you, speakers. 439 00:19:27,340 --> 00:19:30,090 [APPLAUSE]