1 00:00:02,110 --> 00:00:04,540 The following content is provided under a Creative 2 00:00:04,540 --> 00:00:05,950 Commons license. 3 00:00:05,950 --> 00:00:08,980 Your support will help MIT OpenCourseWare continue to 4 00:00:08,980 --> 00:00:12,640 offer high quality educational resources for free. 5 00:00:12,640 --> 00:00:15,530 To make a donation or view additional materials from 6 00:00:15,530 --> 00:00:19,460 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:19,460 --> 00:00:20,710 ocw.mit.edu. 8 00:00:32,180 --> 00:00:36,520 SAMAN AMARASINGHE: Today I'm going to talk a little bit 9 00:00:36,520 --> 00:00:37,790 about computer architecture. 10 00:00:37,790 --> 00:00:41,230 And how computer architecture impacts performance 11 00:00:41,230 --> 00:00:42,190 engineering. 12 00:00:42,190 --> 00:00:48,210 So the main part of this is going through a long overview 13 00:00:48,210 --> 00:00:53,230 of the Pentium, the Nahalem architecture that in the 14 00:00:53,230 --> 00:00:54,620 machines that you guys are using, which 15 00:00:54,620 --> 00:00:57,790 is hot off the press. 16 00:00:57,790 --> 00:01:00,730 Six core processor that's out there now. 17 00:01:00,730 --> 00:01:05,000 And then I will talk a little bit about profiling a program. 18 00:01:05,000 --> 00:01:08,610 And then the next lecture, the TAs are going to demonstrate 19 00:01:08,610 --> 00:01:11,270 how to use some of those profiling tools that you guys 20 00:01:11,270 --> 00:01:13,990 are supposed to use for the project two. 21 00:01:13,990 --> 00:01:17,160 And I guess the project two will probably appear, since we 22 00:01:17,160 --> 00:01:19,880 had a delay, so probably be up here 24 delayed. 23 00:01:19,880 --> 00:01:22,020 We'll release the project, too, up there. 24 00:01:22,020 --> 00:01:25,986 And then at the end of the talk, end of the lecture, I'm 25 00:01:25,986 --> 00:01:28,750 going to talk about a little bit of example. 26 00:01:28,750 --> 00:01:34,520 Go to some program and show what you can gain by looking 27 00:01:34,520 --> 00:01:38,410 at things like profiling, and what 28 00:01:38,410 --> 00:01:39,360 information that you gain. 29 00:01:39,360 --> 00:01:41,980 Just kind of do a high level view. 30 00:01:41,980 --> 00:01:44,750 So let's start something. 31 00:01:44,750 --> 00:01:52,300 So how many of you have had a chip in your hand. 32 00:01:52,300 --> 00:01:54,300 Microprocessor in your hand. 33 00:01:54,300 --> 00:01:54,590 Some. 34 00:01:54,590 --> 00:01:55,630 There's some people who haven't. 35 00:01:55,630 --> 00:01:58,330 So I brought some show and tell things. 36 00:01:58,330 --> 00:02:02,650 So what I'm going to do is pass them around. 37 00:02:02,650 --> 00:02:04,940 And there are two things that are actually very valuable 38 00:02:04,940 --> 00:02:07,310 that I want to tell you and then make sure that they don't 39 00:02:07,310 --> 00:02:08,110 get damaged. 40 00:02:08,110 --> 00:02:13,270 So this is a Pentium 3 that's already in a big 41 00:02:13,270 --> 00:02:14,770 heat sink in there. 42 00:02:14,770 --> 00:02:18,000 So you can't see that much, but they're really cute and 43 00:02:18,000 --> 00:02:22,490 they created a hologram of the chip and put it there so you 44 00:02:22,490 --> 00:02:28,700 can catch, I guess, people who are trying to use counterfeit 45 00:02:28,700 --> 00:02:29,610 and stuff like that. 46 00:02:29,610 --> 00:02:33,780 So this is MDK 6 with the packaging. 47 00:02:33,780 --> 00:02:36,740 I'm going to send it around in there. 48 00:02:36,740 --> 00:02:44,300 And this is a Pentium Pro. 49 00:02:44,300 --> 00:02:46,990 Normally what happens is that when the first dye comes out, 50 00:02:46,990 --> 00:02:49,350 this is humongous, giganiticized. 51 00:02:49,350 --> 00:02:51,670 But after a couple of generations, what you do is 52 00:02:51,670 --> 00:02:54,360 you take the same circuit and you shrink it and shrink it. 53 00:02:54,360 --> 00:02:56,190 This is probably what two shrinks-- 54 00:02:56,190 --> 00:02:59,380 so the actual dye right now is very small because this is 55 00:02:59,380 --> 00:03:00,360 copper [UNINTELLIGIBLE PHRASE] 56 00:03:00,360 --> 00:03:03,770 later process that things have been shrunk. 57 00:03:03,770 --> 00:03:09,910 This is a Core Two Duo So I think there's a 58 00:03:09,910 --> 00:03:11,680 two cores in there. 59 00:03:11,680 --> 00:03:15,650 So I didn't want to bring anything newer because those 60 00:03:15,650 --> 00:03:18,760 chips are darn expensive, and if you touch it, basically 61 00:03:18,760 --> 00:03:21,580 just static will basically destroy the process. 62 00:03:21,580 --> 00:03:26,160 So these are something that came out of things that 63 00:03:26,160 --> 00:03:27,640 doesn't work. 64 00:03:27,640 --> 00:03:28,500 The next two things. 65 00:03:28,500 --> 00:03:29,370 These are-- 66 00:03:29,370 --> 00:03:30,860 I really need to get this [UNINTELLIGIBLE] because these 67 00:03:30,860 --> 00:03:34,950 are like the only thing that's the one 68 00:03:34,950 --> 00:03:36,350 available in this world. 69 00:03:36,350 --> 00:03:38,020 Basically there's only one of item. 70 00:03:38,020 --> 00:03:42,370 So at MIT, we build a process a couple of years ago called a 71 00:03:42,370 --> 00:03:43,540 row processor. 72 00:03:43,540 --> 00:03:46,760 So here's a row dye before it's being mounted. 73 00:03:46,760 --> 00:03:49,920 So this is what comes out after all the fabrication, 74 00:03:49,920 --> 00:03:53,010 before it goes into this mounting with all the pins. 75 00:03:53,010 --> 00:03:56,450 If you look at the top of this one, what do you see? 76 00:03:56,450 --> 00:03:58,410 It's all these dots, which are basically where 77 00:03:58,410 --> 00:03:59,880 all the pins go. 78 00:03:59,880 --> 00:04:02,620 So you don't actually even see inside the dye in this one. 79 00:04:02,620 --> 00:04:05,370 However, then you pay a bunch of money to these people who 80 00:04:05,370 --> 00:04:07,670 go and grind that chip. 81 00:04:07,670 --> 00:04:12,010 You take a real working chip and you grind it, and expose 82 00:04:12,010 --> 00:04:12,860 insides of the chip. 83 00:04:12,860 --> 00:04:14,250 So this is a ground chip that's 84 00:04:14,250 --> 00:04:15,380 exposed the metal layering. 85 00:04:15,380 --> 00:04:18,200 In fact, this you can see real circuits in the chip. 86 00:04:18,200 --> 00:04:22,170 So I don't think anybody has the mission that's required to 87 00:04:22,170 --> 00:04:25,020 see the transistors or even the circuits. 88 00:04:25,020 --> 00:04:26,690 But it's kind of interesting to see. 89 00:04:26,690 --> 00:04:29,993 So here are these two things, so some show 90 00:04:29,993 --> 00:04:31,354 and tell like that. 91 00:04:31,354 --> 00:04:32,250 Ah-ha. 92 00:04:32,250 --> 00:04:33,080 And this. 93 00:04:33,080 --> 00:04:35,970 This is entire reason that microprocessor not getting 94 00:04:35,970 --> 00:04:37,280 faster anymore. 95 00:04:37,280 --> 00:04:42,720 Because this is one humongo heap thing that you put in the 96 00:04:42,720 --> 00:04:45,350 modern machines, and there's a huge fan that goes above that. 97 00:04:45,350 --> 00:04:47,920 And you can't keep building larger and larger heat things 98 00:04:47,920 --> 00:04:50,230 to get the heat out. 99 00:04:50,230 --> 00:04:54,810 And so that this why the heat is paramount, and that's why 100 00:04:54,810 --> 00:04:57,820 we can't run 4 gigahertz, 10 gigahertz process. 101 00:04:57,820 --> 00:05:01,760 We are thick into the current kind of gigahertz train, and 102 00:05:01,760 --> 00:05:02,360 then we [UNINTELLIGIBLE] 103 00:05:02,360 --> 00:05:04,080 multiple cores. 104 00:05:04,080 --> 00:05:07,370 So it's a big, humongo block in there, and then some 105 00:05:07,370 --> 00:05:10,570 actually might even have a bigger block. 106 00:05:10,570 --> 00:05:15,840 So while this go around, let's start the lecture. 107 00:05:15,840 --> 00:05:18,530 So computer architecting includes 108 00:05:18,530 --> 00:05:19,380 many different things. 109 00:05:19,380 --> 00:05:23,220 You are doing instructions, memory, IEO Bus, disk systems, 110 00:05:23,220 --> 00:05:25,420 GPU, graphics, all of those things. 111 00:05:25,420 --> 00:05:27,070 But we are not getting into beyond 112 00:05:27,070 --> 00:05:28,210 instruction and memory system. 113 00:05:28,210 --> 00:05:31,550 So we are going to focus on that, but if you really want 114 00:05:31,550 --> 00:05:33,430 to understand the full end to end performance, you have to 115 00:05:33,430 --> 00:05:35,110 worry about all those other things. 116 00:05:35,110 --> 00:05:39,590 So that said, let's go into instructions and memory. 117 00:05:39,590 --> 00:05:42,670 So here is the Nehalem processor. 118 00:05:42,670 --> 00:05:44,700 And there's a beautiful picture of the 119 00:05:44,700 --> 00:05:46,000 processor in the left. 120 00:05:46,000 --> 00:05:49,770 This as kind of the role chip that I'm showing. 121 00:05:49,770 --> 00:05:53,790 They have ground out the upper layers and showed some of the 122 00:05:53,790 --> 00:05:55,570 layers in there. 123 00:05:55,570 --> 00:05:58,440 So that's what you get when you buy the Nehalem. 124 00:05:58,440 --> 00:06:02,330 And this very complicated diagram in the right side, 125 00:06:02,330 --> 00:06:05,160 it's an abstract notion of what's happening inside. 126 00:06:05,160 --> 00:06:07,820 So what you do is go a little bit into this [UNINTELLIGIBLE] 127 00:06:07,820 --> 00:06:11,025 diagram, and trying to understand what impact it 128 00:06:11,025 --> 00:06:11,820 would have on you. 129 00:06:11,820 --> 00:06:16,090 This is not architecture class, but you, as trying to 130 00:06:16,090 --> 00:06:18,290 get performance, has to make sure all these components 131 00:06:18,290 --> 00:06:20,000 works pretty well. 132 00:06:20,000 --> 00:06:24,250 That said, it's very hard to understand in a modern 133 00:06:24,250 --> 00:06:27,150 microprocessor exactly what's going on. 134 00:06:27,150 --> 00:06:28,980 Not even Intel understand what's going 135 00:06:28,980 --> 00:06:30,070 on most of the time. 136 00:06:30,070 --> 00:06:33,210 And so a lot of things [UNINTELLIGIBLE]. 137 00:06:33,210 --> 00:06:35,850 So you can't ask tell me exactly what 138 00:06:35,850 --> 00:06:37,910 happened or get there. 139 00:06:37,910 --> 00:06:40,130 So you can be fussy. 140 00:06:40,130 --> 00:06:45,640 On the other hand, being fussy means there's a level of 141 00:06:45,640 --> 00:06:47,840 abstraction that you have to live with. 142 00:06:47,840 --> 00:06:51,550 And what that means is when something hit their head very 143 00:06:51,550 --> 00:06:54,430 hard, and have a really bad situation you can see that and 144 00:06:54,430 --> 00:06:55,530 you can react to that. 145 00:06:55,530 --> 00:06:57,530 But there are a lot of small things that happen. 146 00:06:57,530 --> 00:06:59,850 These things interact in very complex ways. 147 00:06:59,850 --> 00:07:03,250 So you don't understand exactly the minor detail 148 00:07:03,250 --> 00:07:05,990 what's going on in these microprocessor for a given 149 00:07:05,990 --> 00:07:06,880 application. 150 00:07:06,880 --> 00:07:08,850 A lot of different things might work different ways. 151 00:07:08,850 --> 00:07:09,840 So that's it. 152 00:07:09,840 --> 00:07:14,380 So if we look at what you learn from 004, so what that 153 00:07:14,380 --> 00:07:15,085 means is instruction. 154 00:07:15,085 --> 00:07:16,640 You have here two instructions. 155 00:07:16,640 --> 00:07:17,890 One after the other. 156 00:07:17,890 --> 00:07:19,540 So if the instructions take [UNINTELLIGIBLE] 157 00:07:19,540 --> 00:07:19,870 cycle. 158 00:07:19,870 --> 00:07:22,150 So we see the non-[UNINTELLIGIBLE] 159 00:07:22,150 --> 00:07:23,340 method. 160 00:07:23,340 --> 00:07:26,170 So you look at why it's probably taking five cycles. 161 00:07:26,170 --> 00:07:28,130 So that's because it's doing different things. 162 00:07:28,130 --> 00:07:30,870 You are doing instruction fetch, instruction decode, 163 00:07:30,870 --> 00:07:32,250 after that execute. 164 00:07:32,250 --> 00:07:34,260 Then you do some memory [UNINTELLIGIBLE] and then find 165 00:07:34,260 --> 00:07:36,070 the right packet in there. 166 00:07:36,070 --> 00:07:37,820 So you do all those things [UNINTELLIGIBLE] 167 00:07:37,820 --> 00:07:39,520 instruction, and then you start the next 168 00:07:39,520 --> 00:07:40,530 [UNINTELLIGIBLE] cycle. 169 00:07:40,530 --> 00:07:42,320 Of course, this is very necessary. 170 00:07:42,320 --> 00:07:44,410 So the first two people [UNINTELLIGIBLE]. 171 00:07:44,410 --> 00:07:48,065 After I do instruction fetch, that logic that's doing 172 00:07:48,065 --> 00:07:50,370 instruction fetch is not doing anything for a while. 173 00:07:50,370 --> 00:07:52,740 Why don't I start the next instruction phase immediately 174 00:07:52,740 --> 00:07:55,720 because you are doing the same circuit, and then recycle, 175 00:07:55,720 --> 00:07:57,840 that's so it can go do the next thing, and next thing, 176 00:07:57,840 --> 00:07:58,660 and next thing. 177 00:07:58,660 --> 00:08:03,120 And by doing that, well I can basically recycle, I can get 178 00:08:03,120 --> 00:08:05,730 the instruction through the system. 179 00:08:05,730 --> 00:08:08,470 So this looks very nice and simple and you can get. 180 00:08:08,470 --> 00:08:12,060 So what are they choosing here? 181 00:08:12,060 --> 00:08:15,400 Is the world this nice and simple? 182 00:08:15,400 --> 00:08:16,360 No. 183 00:08:16,360 --> 00:08:18,586 OK, what might happen? 184 00:08:18,586 --> 00:08:19,836 AUDIENCE: [INAUDIBLE PHRASE] 185 00:08:21,930 --> 00:08:24,590 SAMAN AMARASINGHE: You have a lot of issues that normally 186 00:08:24,590 --> 00:08:26,520 cause hazards. 187 00:08:26,520 --> 00:08:29,180 So there could be three different type of hazard. 188 00:08:29,180 --> 00:08:32,630 There's a thing called structural hazard. 189 00:08:32,630 --> 00:08:35,030 What you are trying to do is attempt to use the same 190 00:08:35,030 --> 00:08:37,760 hardware, do two things, but there's only one hardware. 191 00:08:37,760 --> 00:08:38,960 You can't get two things done. 192 00:08:38,960 --> 00:08:40,960 One has to be after another. 193 00:08:40,960 --> 00:08:42,760 And there's a thing called data hazards. 194 00:08:42,760 --> 00:08:46,700 That means you're trying to run these things, one now 195 00:08:46,700 --> 00:08:48,500 another, but in logically. 196 00:08:48,500 --> 00:08:50,590 They have to run sequence serially. 197 00:08:50,590 --> 00:08:53,740 So that means that's a data dependence, I'll get into, 198 00:08:53,740 --> 00:08:56,850 that makes it impossible to make these things run in the 199 00:08:56,850 --> 00:08:57,910 pipeline fashion. 200 00:08:57,910 --> 00:09:00,460 And finally, this control has things like branches and 201 00:09:00,460 --> 00:09:01,800 things like that, [UNINTELLIGIBLE] 202 00:09:01,800 --> 00:09:02,700 interfere with that. 203 00:09:02,700 --> 00:09:07,040 So let me get a little bit detail down here. 204 00:09:07,040 --> 00:09:10,460 So the first thing we have is what we call data hazard. 205 00:09:10,460 --> 00:09:12,160 I will talk about two different hazards. 206 00:09:12,160 --> 00:09:14,940 So before I go there, I will share a little bit about this 207 00:09:14,940 --> 00:09:16,240 assembly representation. 208 00:09:16,240 --> 00:09:20,060 In two lectures, you're going to get a more deep drilling 209 00:09:20,060 --> 00:09:23,710 into the how to go from C to Assembly. 210 00:09:23,710 --> 00:09:26,520 But before that, so what this instruction says, this is the 211 00:09:26,520 --> 00:09:27,690 normal x86 form. 212 00:09:27,690 --> 00:09:30,650 We are doing add long. 213 00:09:30,650 --> 00:09:34,720 Add in to the values in this rbx and rax and put the 214 00:09:34,720 --> 00:09:36,720 results back into rax. 215 00:09:36,720 --> 00:09:40,410 So you are doing rbx plus rax and get the result into rax. 216 00:09:40,410 --> 00:09:44,710 You are subjecting rax with rcx and put the 217 00:09:44,710 --> 00:09:46,100 results into rcx. 218 00:09:46,100 --> 00:09:48,110 So basically all the rest of the results are in the 219 00:09:48,110 --> 00:09:49,640 right-hand side. 220 00:09:49,640 --> 00:09:50,250 Two [UNINTELLIGIBLE] 221 00:09:50,250 --> 00:09:52,250 are basically the first and second. 222 00:09:52,250 --> 00:09:55,830 So the last one gets read and modified. 223 00:09:55,830 --> 00:09:56,930 So that's the way it's weighted. 224 00:09:56,930 --> 00:09:58,130 So while you have two is-- 225 00:09:58,130 --> 00:09:59,910 question? 226 00:09:59,910 --> 00:10:01,160 AUDIENCE: [INAUDIBLE PHRASE] 227 00:10:03,630 --> 00:10:05,600 SAMAN AMARASINGHE: Instruction, yeah. 228 00:10:05,600 --> 00:10:06,470 It's different concept. 229 00:10:06,470 --> 00:10:09,540 The second instruction is data dependent, of course. 230 00:10:09,540 --> 00:10:12,720 What's happening is I am writing this value here, which 231 00:10:12,720 --> 00:10:14,830 will be read by this guy here. 232 00:10:14,830 --> 00:10:16,390 So I write something and the next 233 00:10:16,390 --> 00:10:17,530 instruction is reading that. 234 00:10:17,530 --> 00:10:21,920 The problem is it's the right thing must not be available 235 00:10:21,920 --> 00:10:24,100 until very late in the pipeline. 236 00:10:24,100 --> 00:10:29,270 So they cannot execute simultaneously and all that 237 00:10:29,270 --> 00:10:31,580 thing basically of this dependence. 238 00:10:31,580 --> 00:10:35,270 This is called basically read after write, because I'm going 239 00:10:35,270 --> 00:10:37,260 to read after it's been written. 240 00:10:37,260 --> 00:10:40,190 And if you look at the pipeline here, what happens is 241 00:10:40,190 --> 00:10:45,980 the write happens here, and the next feed has to happen 242 00:10:45,980 --> 00:10:47,720 somewhere down here. 243 00:10:47,720 --> 00:10:50,740 So until this one, this slide [UNINTELLIGIBLE] 244 00:10:50,740 --> 00:10:53,910 so this status has to get delayed after this point. 245 00:10:53,910 --> 00:10:56,210 This make sense? 246 00:10:56,210 --> 00:11:00,140 Because I'm trying to in here read something that's not 247 00:11:00,140 --> 00:11:02,690 going to be produced for two more clock cycles. 248 00:11:02,690 --> 00:11:05,340 So that's not available and I can't do that. 249 00:11:05,340 --> 00:11:06,990 Make sense? 250 00:11:06,990 --> 00:11:08,240 OK. 251 00:11:11,530 --> 00:11:15,730 So this next cycle dependence is called name dependence. 252 00:11:15,730 --> 00:11:18,610 That's the dependence and anti-dependence. 253 00:11:18,610 --> 00:11:22,550 Basically what it's doing is two instructions are using the 254 00:11:22,550 --> 00:11:24,180 same register. 255 00:11:24,180 --> 00:11:27,760 And because of that I can't start to see the radius there 256 00:11:27,760 --> 00:11:29,760 until the other one switches off with the register. 257 00:11:29,760 --> 00:11:34,460 So in here, what's happening here is I am 258 00:11:34,460 --> 00:11:38,850 subtracting rax and rbx. 259 00:11:38,850 --> 00:11:41,400 I'm putting the value in rbx. 260 00:11:41,400 --> 00:11:46,500 And here I am adding rcx to rax, and basically start 261 00:11:46,500 --> 00:11:48,240 putting the value in rax. 262 00:11:48,240 --> 00:11:53,150 The problem here is I cannot modify rax before this guy has 263 00:11:53,150 --> 00:11:53,670 [UNINTELLIGIBLE] 264 00:11:53,670 --> 00:11:56,040 the value in here. 265 00:11:56,040 --> 00:11:58,150 So that means I'm trying to go modify it, and say no, no, no. 266 00:11:58,150 --> 00:12:00,210 You can't touch it because somebody will still want to 267 00:12:00,210 --> 00:12:02,650 get the value, and I can't go destroy the value. 268 00:12:02,650 --> 00:12:04,940 And I had to [UNINTELLIGIBLE]. 269 00:12:04,940 --> 00:12:07,760 That's the first type of what you call anti-dependence. 270 00:12:07,760 --> 00:12:10,780 And because of this, you both are using rax. 271 00:12:10,780 --> 00:12:13,060 There's no real data movement, but it basically 272 00:12:13,060 --> 00:12:14,440 has the same space. 273 00:12:14,440 --> 00:12:18,180 I had to wait till the other person can be evicted before I 274 00:12:18,180 --> 00:12:20,790 can read that register. 275 00:12:20,790 --> 00:12:24,680 So these are called write after read hazard. 276 00:12:24,680 --> 00:12:28,020 The other type of dependence is called output dependence. 277 00:12:28,020 --> 00:12:33,060 So I am updating rax twice, and of course in this simple 278 00:12:33,060 --> 00:12:35,270 example we can probably even drop this instruction because 279 00:12:35,270 --> 00:12:37,180 it doesn't matter because I'm re-writing it. 280 00:12:37,180 --> 00:12:40,040 But if there's something in between also reading, what 281 00:12:40,040 --> 00:12:43,000 happens is I can't do this in wrong order. 282 00:12:43,000 --> 00:12:46,010 The last value that updated has to be this instruction. 283 00:12:46,010 --> 00:12:48,970 So we had to make sure this instruction happens after this 284 00:12:48,970 --> 00:12:49,930 instruction. 285 00:12:49,930 --> 00:12:53,140 So we have a dependence, so there's some ordering in here 286 00:12:53,140 --> 00:12:56,030 that has to be maintained. 287 00:12:56,030 --> 00:13:00,440 And this is call write after write hazard. 288 00:13:00,440 --> 00:13:03,320 So instructions that have medium dependence, we can 289 00:13:03,320 --> 00:13:06,650 actually get rid of the dependent by what we call 290 00:13:06,650 --> 00:13:08,510 register renaming. 291 00:13:08,510 --> 00:13:13,520 So everybody uses rax, so that's kind of artifact of 292 00:13:13,520 --> 00:13:15,030 good old inter [UNINTELLIGIBLE] 293 00:13:15,030 --> 00:13:16,160 architecture sometimes [UNINTELLIGIBLE] 294 00:13:16,160 --> 00:13:18,440 when register, so you have to use the same register for 295 00:13:18,440 --> 00:13:19,540 many, many things. 296 00:13:19,540 --> 00:13:21,060 So in a modern-- 297 00:13:21,060 --> 00:13:22,760 inside the hardware there's what we 298 00:13:22,760 --> 00:13:24,560 call a register renaming. 299 00:13:24,560 --> 00:13:29,140 You have lots more slots for the same registers, and even 300 00:13:29,140 --> 00:13:31,680 when you're using rax, this is an old one, this is a new one, 301 00:13:31,680 --> 00:13:33,790 use a different location for the new one. 302 00:13:33,790 --> 00:13:35,330 So I can do renaming. 303 00:13:35,330 --> 00:13:39,350 And then basically hardware can get rid of that. 304 00:13:39,350 --> 00:13:41,480 The next interesting thing is control hazard. 305 00:13:41,480 --> 00:13:45,930 So here what we see, if you had this kind of a loop, s1, 306 00:13:45,930 --> 00:13:48,950 it's control dependent on p1. 307 00:13:48,950 --> 00:13:52,670 So we can't do s1 until p1 is done. 308 00:13:52,670 --> 00:13:55,210 Or s2 until p2 is done. 309 00:13:55,210 --> 00:13:56,610 So we are [UNINTELLIGIBLE] 310 00:13:56,610 --> 00:13:58,620 the p1 condition is [UNINTELLIGIBLE] 311 00:13:58,620 --> 00:14:00,640 before you can do s1 and [UNINTELLIGIBLE] 312 00:14:00,640 --> 00:14:01,920 for the next one. 313 00:14:01,920 --> 00:14:06,870 So the interesting thing is control dependence also we can 314 00:14:06,870 --> 00:14:09,920 get rid of it by doing speculation. 315 00:14:09,920 --> 00:14:12,320 So the idea there what hardware does is it will say 316 00:14:12,320 --> 00:14:13,200 wait a minute. 317 00:14:13,200 --> 00:14:16,660 I know I have to wait till p1 is [UNINTELLIGIBLE] to do s1. 318 00:14:16,660 --> 00:14:19,170 But I'm going to do s1 anyway. 319 00:14:19,170 --> 00:14:22,180 I will just go to s1, speculatively. 320 00:14:22,180 --> 00:14:26,780 And then at the end of doing that, at some time when p1 is 321 00:14:26,780 --> 00:14:28,370 calculated, and I know what the p1, I 322 00:14:28,370 --> 00:14:30,510 said did I do it right? 323 00:14:30,510 --> 00:14:32,960 If I did it right, I'm good. 324 00:14:32,960 --> 00:14:34,710 If I did something wrong, I have done some 325 00:14:34,710 --> 00:14:35,340 [UNINTELLIGIBLE] 326 00:14:35,340 --> 00:14:37,130 that is not useful for that. 327 00:14:37,130 --> 00:14:37,450 Question? 328 00:14:37,450 --> 00:14:38,700 AUDIENCE: [INAUDIBLE PHRASE] 329 00:14:43,160 --> 00:14:46,630 SAMAN AMARASINGHE: So what happens is there's this thing 330 00:14:46,630 --> 00:14:48,380 called a write buffer. 331 00:14:48,380 --> 00:14:51,940 The complex thing is at the end of the day, the 332 00:14:51,940 --> 00:14:54,685 instructions have to be get committed in 333 00:14:54,685 --> 00:14:57,340 the order they arrive. 334 00:14:57,340 --> 00:15:02,295 So before committing, you keep the state in buffers without 335 00:15:02,295 --> 00:15:02,990 [UNINTELLIGIBLE] 336 00:15:02,990 --> 00:15:03,240 into the main one. 337 00:15:03,240 --> 00:15:05,380 AUDIENCE: [INAUDIBLE PHRASE] 338 00:15:05,380 --> 00:15:07,460 SAMAN AMARASINGHE: So right back into memory doesn't 339 00:15:07,460 --> 00:15:10,570 happen until the commit point. 340 00:15:10,570 --> 00:15:13,570 But when you read something, before reading the memory, say 341 00:15:13,570 --> 00:15:15,650 OK, I have updated something, is it in the right buffer? 342 00:15:15,650 --> 00:15:18,170 So I read things from the right buffer in there. 343 00:15:18,170 --> 00:15:21,590 So you had to go in order commitment. 344 00:15:21,590 --> 00:15:23,250 Because you can't do out of order commitment, that can do 345 00:15:23,250 --> 00:15:24,290 crazy things. 346 00:15:24,290 --> 00:15:27,250 You do in order commitment, but inside the metrics things 347 00:15:27,250 --> 00:15:29,250 can go easier. 348 00:15:29,250 --> 00:15:32,400 I'm just kind of jumping the gun a little bit. 349 00:15:32,400 --> 00:15:37,660 So in a modern Nehalem processor, these are kind of 350 00:15:37,660 --> 00:15:40,290 Intel [UNINTELLIGIBLE], so you don't know exactly what it is. 351 00:15:40,290 --> 00:15:42,870 In some place it says it has 16 clock cycles. 352 00:15:42,870 --> 00:15:45,770 So it says it might be 16 stages of pipeline. 353 00:15:45,770 --> 00:15:50,600 And another place it says 20 to 24 stages of pipeline. 354 00:15:50,600 --> 00:15:53,170 OK, if you work for Intel and sign the NDA you will get the 355 00:15:53,170 --> 00:15:55,050 exact number. 356 00:15:55,050 --> 00:15:55,840 But it doesn't matter. 357 00:15:55,840 --> 00:15:57,580 It's a lot of pipeline stages, so that's 358 00:15:57,580 --> 00:15:58,450 all you need to know. 359 00:15:58,450 --> 00:16:01,130 So what happens is if you are doing this in sort of five 360 00:16:01,130 --> 00:16:04,020 stage, like what you did in probably beta. 361 00:16:04,020 --> 00:16:09,590 You do instruction decode, instruction queue, pre-record 362 00:16:09,590 --> 00:16:13,090 queue, decode, register rename, 363 00:16:13,090 --> 00:16:14,490 and allocate registers. 364 00:16:14,490 --> 00:16:17,090 And then there's thing called reservation and stations that 365 00:16:17,090 --> 00:16:21,490 wait for when data's available to execute that instruction. 366 00:16:21,490 --> 00:16:23,730 And then execute, and then basically go there. 367 00:16:23,730 --> 00:16:26,430 This is what normally instruction life would be, if 368 00:16:26,430 --> 00:16:30,060 everything goes one after another after another. 369 00:16:30,060 --> 00:16:36,260 So what to get out of this is the pipelines are long. 370 00:16:36,260 --> 00:16:39,620 So pipeline stores and stuff can be expensive. 371 00:16:39,620 --> 00:16:42,490 So the other thing you can do in abstract way 372 00:16:42,490 --> 00:16:44,870 is do multiple issue. 373 00:16:44,870 --> 00:16:47,610 So if you do pipeline, what happens is you do one at a 374 00:16:47,610 --> 00:16:50,620 time so you can use the pipeline stages nicely. 375 00:16:50,620 --> 00:16:54,000 How about instead of having one unit, if you have integer 376 00:16:54,000 --> 00:16:55,640 unit and a floating point unit. 377 00:16:55,640 --> 00:16:58,120 And every clock cycle you can basically abstract. 378 00:16:58,120 --> 00:17:00,600 You could say look, I'm taking interger instruction, I'm 379 00:17:00,600 --> 00:17:02,300 taking floating point instruction, and take them 380 00:17:02,300 --> 00:17:03,910 both together [UNINTELLIGIBLE]. 381 00:17:03,910 --> 00:17:06,859 And then whala, at the end of the day in one clock cycle I 382 00:17:06,859 --> 00:17:07,930 can [UNINTELLIGIBLE] 383 00:17:07,930 --> 00:17:10,579 two instructions, one integer, one floating point, if you 384 00:17:10,579 --> 00:17:12,329 have two different sets in there. 385 00:17:12,329 --> 00:17:14,660 So this is also very nice. 386 00:17:14,660 --> 00:17:19,050 You can get a lot of what we call super scaler performance. 387 00:17:19,050 --> 00:17:22,085 However, it's called instruction level parallelism. 388 00:17:22,085 --> 00:17:24,530 There's a lot of problems, also things that you worry. 389 00:17:24,530 --> 00:17:26,115 Well first of all, you have to have enough instruction in 390 00:17:26,115 --> 00:17:26,740 level parallel. 391 00:17:26,740 --> 00:17:30,230 Because the instruction you get is the sequence history, 392 00:17:30,230 --> 00:17:31,890 and you see one after another. 393 00:17:31,890 --> 00:17:34,310 I mean you write your program and compile. 394 00:17:34,310 --> 00:17:36,840 It looks like you run one instruction, finish it, run 395 00:17:36,840 --> 00:17:37,590 another, run another. 396 00:17:37,590 --> 00:17:40,240 So in order to form parallely, you're to find things. 397 00:17:40,240 --> 00:17:41,760 You can actually do parallel. 398 00:17:41,760 --> 00:17:44,950 That takes time. 399 00:17:44,950 --> 00:17:47,540 And of course, between hardware and software you are 400 00:17:47,540 --> 00:17:51,300 to maintain that preserve order of instructions, because 401 00:17:51,300 --> 00:17:53,785 you want to make sure that it looks like and it feels like 402 00:17:53,785 --> 00:17:55,730 you ran one after another after another. 403 00:17:55,730 --> 00:17:58,310 You don't want things to run in a haphazard way and create 404 00:17:58,310 --> 00:17:59,590 arbitrary result. 405 00:17:59,590 --> 00:18:02,880 And so you have to, again, things like this, data 406 00:18:02,880 --> 00:18:04,210 dependences, control dependences have to be 407 00:18:04,210 --> 00:18:08,440 satisfied when you're getting this parallelism. 408 00:18:08,440 --> 00:18:14,070 So data dependency, there's a hazard, and determining which 409 00:18:14,070 --> 00:18:14,910 order you can run things. 410 00:18:14,910 --> 00:18:16,600 So for example, output dependence. 411 00:18:16,600 --> 00:18:18,580 Say look, I can't just run them parallel. 412 00:18:18,580 --> 00:18:20,480 I have to make one after another after another. 413 00:18:20,480 --> 00:18:22,110 You're to get all those things done. 414 00:18:22,110 --> 00:18:24,470 And also if you're lot of dependency kind of gives you 415 00:18:24,470 --> 00:18:26,910 bounds, how much parallelism you can get. 416 00:18:26,910 --> 00:18:29,230 If things are all one dependent against one after 417 00:18:29,230 --> 00:18:30,140 another, you can't run them parallel. 418 00:18:30,140 --> 00:18:32,390 You have to wait to run things to another one. 419 00:18:32,390 --> 00:18:34,670 So by looking at data dependence, as you can figure 420 00:18:34,670 --> 00:18:40,080 out, OK look, did I get good performance, good ILT or not? 421 00:18:40,080 --> 00:18:43,750 So what we want to do is exploit this parallelism like 422 00:18:43,750 --> 00:18:45,900 we see in the program order. 423 00:18:45,900 --> 00:18:50,370 And basically make sure that you always get the 424 00:18:50,370 --> 00:18:52,410 same result on that. 425 00:18:52,410 --> 00:18:56,680 One way of getting parallelism that is in modern process is 426 00:18:56,680 --> 00:18:59,120 called multimedia instruction. 427 00:18:59,120 --> 00:19:02,460 It's called SIMD in academic circle, something like Single 428 00:19:02,460 --> 00:19:06,540 Instruction Multiple Data, and it's called data level 429 00:19:06,540 --> 00:19:08,320 parallelism, and Intel, of course, has to give the 430 00:19:08,320 --> 00:19:10,030 [UNINTELLIGIBLE] name they call SSE, they 431 00:19:10,030 --> 00:19:11,870 just call it MMX. 432 00:19:11,870 --> 00:19:14,670 So the idea there is look, they are 433 00:19:14,670 --> 00:19:16,000 building this wide measure. 434 00:19:16,000 --> 00:19:20,810 They can easily build 128 bit wide register. 435 00:19:20,810 --> 00:19:24,710 But you most probably don't need 128 bit data, because 436 00:19:24,710 --> 00:19:25,710 that's too big. 437 00:19:25,710 --> 00:19:27,740 But they can build this large. 438 00:19:27,740 --> 00:19:30,200 And most will tell you about happy with about 32 bit data. 439 00:19:30,200 --> 00:19:30,950 So what? 440 00:19:30,950 --> 00:19:31,680 [UNINTELLIGIBLE] 441 00:19:31,680 --> 00:19:32,085 wait a minute. 442 00:19:32,085 --> 00:19:33,550 You build this wide thing. 443 00:19:33,550 --> 00:19:37,200 But you can chop it into four pieces or eight pieces or two 444 00:19:37,200 --> 00:19:38,340 pieces, what you want. 445 00:19:38,340 --> 00:19:41,830 So here we chop that large thing into four pieces. 446 00:19:41,830 --> 00:19:45,980 And then what you can do is in a single instruction, you can 447 00:19:45,980 --> 00:19:49,610 instead of doing one large add, the same type of adder, 448 00:19:49,610 --> 00:19:51,000 you can add four separate things. 449 00:19:51,000 --> 00:19:52,370 Four small parts. 450 00:19:52,370 --> 00:19:56,030 So you're assuring add, but instead of adding 228 bit 451 00:19:56,030 --> 00:20:01,590 data, what you are doing is adding here, four 32 bit data. 452 00:20:01,590 --> 00:20:04,230 The entire architecture looks very identical 453 00:20:04,230 --> 00:20:05,665 to an 228 bit data. 454 00:20:05,665 --> 00:20:08,720 It's basically to remember your double 455 00:20:08,720 --> 00:20:09,640 004, it's like a carry. 456 00:20:09,640 --> 00:20:11,160 Then just have to come the carry. 457 00:20:11,160 --> 00:20:14,460 You don't carry from here to here, and do that. 458 00:20:14,460 --> 00:20:17,840 So this is you're loading this large chunk, and you are just 459 00:20:17,840 --> 00:20:19,050 working on these things. 460 00:20:19,050 --> 00:20:21,800 And whala, by doing that, having the same kind of 461 00:20:21,800 --> 00:20:24,770 circuitry, you get much better parallelism. 462 00:20:24,770 --> 00:20:27,570 And if you have like 8 bit data, boy, you can get lots of 463 00:20:27,570 --> 00:20:28,830 parallelism. 464 00:20:28,830 --> 00:20:34,380 So there's all these multiple instructions in there. 465 00:20:34,380 --> 00:20:37,560 And so what you can do is you can have a loop like this. 466 00:20:37,560 --> 00:20:40,240 So you can, in every situation, you can add the I 467 00:20:40,240 --> 00:20:42,890 to BI, and create pretty good [UNINTELLIGIBLE]. 468 00:20:42,890 --> 00:20:45,160 So you can do one at a time. 469 00:20:45,160 --> 00:20:48,700 But assume you have 32 bit values. 470 00:20:48,700 --> 00:20:51,110 Instead of doing-- single instruction you can load 0 to 471 00:20:51,110 --> 00:20:53,360 8 3 in one go. 472 00:20:53,360 --> 00:20:55,000 B0 to BC in one go. 473 00:20:55,000 --> 00:20:58,990 Just do the add of all four in one, go [UNINTELLIGIBLE]. 474 00:20:58,990 --> 00:21:01,520 So in single time you get four things done. 475 00:21:01,520 --> 00:21:02,680 So this is really cool. 476 00:21:02,680 --> 00:21:08,870 And if you remember my first lecture, we showed this matrix 477 00:21:08,870 --> 00:21:09,790 multiplier. 478 00:21:09,790 --> 00:21:10,725 If you get [UNINTELLIGIBLE] 479 00:21:10,725 --> 00:21:13,280 in there, you get a lot less instructions executed. 480 00:21:13,280 --> 00:21:16,420 We got a nice [UNINTELLIGIBLE]. 481 00:21:16,420 --> 00:21:21,270 So I'm going down, so the top path I'm showing, the C code-- 482 00:21:21,270 --> 00:21:25,030 for that loop, and then the bottom half in there, I'm 483 00:21:25,030 --> 00:21:27,240 showing the normal assembly that we 484 00:21:27,240 --> 00:21:29,380 [UNINTELLIGIBLE], what you see. 485 00:21:29,380 --> 00:21:32,410 And once you start understanding assembly a 486 00:21:32,410 --> 00:21:34,630 little bit more, you can go ahead and read this and see 487 00:21:34,630 --> 00:21:36,210 what's going on. 488 00:21:36,210 --> 00:21:38,360 And then if you [UNINTELLIGIBLE], you get a 489 00:21:38,360 --> 00:21:40,030 humongous beast like this. 490 00:21:40,030 --> 00:21:43,250 This has unrolled [UNINTELLIGIBLE], whatever. 491 00:21:43,250 --> 00:21:48,150 And this is the kind of code that we generated by 492 00:21:48,150 --> 00:21:49,490 [UNINTELLIGIBLE]. 493 00:21:49,490 --> 00:21:52,140 And that's just much, much faster than 494 00:21:52,140 --> 00:21:53,390 the one on the left. 495 00:21:55,870 --> 00:21:58,520 Any questions so far? 496 00:21:58,520 --> 00:22:01,140 OK. 497 00:22:01,140 --> 00:22:03,980 So when we have this superscalar [UNINTELLIGIBLE], 498 00:22:03,980 --> 00:22:07,690 with multiple execution, so what you can do is in a clock 499 00:22:07,690 --> 00:22:09,940 cycle, you can do a lot of things. 500 00:22:09,940 --> 00:22:13,550 So in the [UNINTELLIGIBLE] processor, you have basically 501 00:22:13,550 --> 00:22:16,665 six things, six execution things that can happen in one 502 00:22:16,665 --> 00:22:17,520 clock cycle. 503 00:22:17,520 --> 00:22:20,590 It's not the same thing, so you can do a data store. 504 00:22:20,590 --> 00:22:27,010 You can store address, load address from there, and then 505 00:22:27,010 --> 00:22:30,330 you can do three different type of executions. 506 00:22:30,330 --> 00:22:33,760 There are three units in here. 507 00:22:33,760 --> 00:22:38,660 In here you can do either integer or SSE, add or move 508 00:22:38,660 --> 00:22:41,320 instruction, this unit can do this kind of instruction, you 509 00:22:41,320 --> 00:22:44,900 can have a floating point add-in around here. 510 00:22:44,900 --> 00:22:47,240 This one needs to also have this bunch of different things 511 00:22:47,240 --> 00:22:48,010 you can do. 512 00:22:48,010 --> 00:22:51,280 So this complex hardware, we'll kind of figure out, OK, 513 00:22:51,280 --> 00:22:53,990 if you have these instructions, figure out what 514 00:22:53,990 --> 00:22:55,270 of these [UNINTELLIGIBLE] 515 00:22:55,270 --> 00:22:56,100 are available. 516 00:22:56,100 --> 00:22:58,560 If the instruction is able to execute, it will put it into 517 00:22:58,560 --> 00:23:00,310 one of these and het the results out. 518 00:23:00,310 --> 00:23:05,420 So if you're lucky, what you can see is, you might get 519 00:23:05,420 --> 00:23:09,320 about six instructions going each cycle. 520 00:23:09,320 --> 00:23:12,980 So if you look at the mesh here, what we call clock 521 00:23:12,980 --> 00:23:15,350 [UNINTELLIGIBLE] instruction, it could be 1/6. 522 00:23:15,350 --> 00:23:16,020 If you're lucky. 523 00:23:16,020 --> 00:23:18,600 If you can run the mesh into the metal, that means 524 00:23:18,600 --> 00:23:21,340 everything is working all the time. 525 00:23:21,340 --> 00:23:24,904 It's very hard to find that kind of program, but 526 00:23:24,904 --> 00:23:25,860 [UNINTELLIGIBLE] 527 00:23:25,860 --> 00:23:27,106 can get something like that. 528 00:23:27,106 --> 00:23:28,356 That's really cool. 529 00:23:30,570 --> 00:23:33,750 So what happens is to get this thing, they do things called 530 00:23:33,750 --> 00:23:34,860 out-of-order instruction. 531 00:23:34,860 --> 00:23:39,050 So what that means is, issuing this varying number of 532 00:23:39,050 --> 00:23:42,120 instructions in the clock, in here, in the Nehalem 533 00:23:42,120 --> 00:23:46,860 processor, you can store 128 instructions in this core 534 00:23:46,860 --> 00:23:48,630 instruction buffer. 535 00:23:48,630 --> 00:23:49,890 [INAUDIBLE] 536 00:23:49,890 --> 00:23:51,372 better name than instruction buffer. 537 00:23:51,372 --> 00:23:53,970 If you can see that, [INAUDIBLE] very far. 538 00:23:53,970 --> 00:23:55,220 But [INAUDIBLE] up there, sorry. 539 00:23:59,770 --> 00:24:00,300 Here. 540 00:24:00,300 --> 00:24:04,120 It's called reservation stations. 541 00:24:04,120 --> 00:24:06,140 So when you put it in, all those [UNINTELLIGIBLE], 542 00:24:06,140 --> 00:24:08,550 reformation [UNINTELLIGIBLE], ready to get, and the minute 543 00:24:08,550 --> 00:24:11,070 it becomes available, that means all the data is ready 544 00:24:11,070 --> 00:24:14,030 for it to run, you can basically issue that and run. 545 00:24:16,800 --> 00:24:20,590 So what they're doing is, in these hundreds of 546 00:24:20,590 --> 00:24:22,780 instructions, you're figuring out the patterns you can run. 547 00:24:22,780 --> 00:24:25,200 And a minute some part is available, the data becomes 548 00:24:25,200 --> 00:24:28,010 available, you can run that. 549 00:24:28,010 --> 00:24:30,910 And so that means the program might have instructions in 550 00:24:30,910 --> 00:24:31,720 certain order. 551 00:24:31,720 --> 00:24:34,690 It gets executed in very different orders than what the 552 00:24:34,690 --> 00:24:35,660 program has. 553 00:24:35,660 --> 00:24:37,770 So that's what call you can give things like 554 00:24:37,770 --> 00:24:38,170 [UNINTELLIGIBLE] 555 00:24:38,170 --> 00:24:40,940 register, [UNINTELLIGIBLE] dependencies, [UNINTELLIGIBLE] 556 00:24:40,940 --> 00:24:41,550 execution. 557 00:24:41,550 --> 00:24:42,910 And sometimes you might run speculator. 558 00:24:42,910 --> 00:24:44,250 You might not know. 559 00:24:44,250 --> 00:24:46,390 You might have the data, but there might be controlled 560 00:24:46,390 --> 00:24:46,830 dependence. 561 00:24:46,830 --> 00:24:49,010 You might not know that it's actually is supposed to 562 00:24:49,010 --> 00:24:50,655 happen, but it says wait a minute, I 563 00:24:50,655 --> 00:24:52,430 have execution available. 564 00:24:52,430 --> 00:24:53,380 It's not doing anything. 565 00:24:53,380 --> 00:24:57,250 OK, let's run it and keep the results, and if it is actually 566 00:24:57,250 --> 00:24:58,830 useful, we'll use it. 567 00:24:58,830 --> 00:25:00,280 Otherwise, we'll throw it away. 568 00:25:00,280 --> 00:25:02,290 You can do things like that. 569 00:25:02,290 --> 00:25:03,700 And that's a speculation. 570 00:25:03,700 --> 00:25:06,580 So another type of speculation-- 571 00:25:06,580 --> 00:25:10,580 speculation mostly means, you want to do something, but you 572 00:25:10,580 --> 00:25:12,520 don't know whether you can do it. 573 00:25:12,520 --> 00:25:14,180 So how do you go about doing that? 574 00:25:14,180 --> 00:25:17,480 The way that interprocess and modern processor do, is they 575 00:25:17,480 --> 00:25:19,230 do a lot of predictions. 576 00:25:19,230 --> 00:25:21,510 It has like this is [UNINTELLIGIBLE] say, look at 577 00:25:21,510 --> 00:25:24,360 the crystal ball and say, aha, this is the future. 578 00:25:24,360 --> 00:25:25,960 And it does that [UNINTELLIGIBLE] 579 00:25:25,960 --> 00:25:27,210 really many things. 580 00:25:27,210 --> 00:25:31,030 So it does things like branch prediction, value prediction, 581 00:25:31,030 --> 00:25:32,050 prefetching. 582 00:25:32,050 --> 00:25:35,100 So it says, aha, I don't know where you're going, but I 583 00:25:35,100 --> 00:25:38,260 think you are going home after the class. 584 00:25:38,260 --> 00:25:39,840 And then probably if I said something like this, it's 585 00:25:39,840 --> 00:25:41,480 probably right for most of the people. 586 00:25:41,480 --> 00:25:43,860 And actually, no, in this class, most of you, again, 587 00:25:43,860 --> 00:25:46,660 will probably go to sleep after the class, because of 588 00:25:46,660 --> 00:25:47,930 the project. 589 00:25:47,930 --> 00:25:52,160 So if I do that probably I do a very good prediction. 590 00:25:52,160 --> 00:25:56,380 How many of you are going to go to sleep after the class? 591 00:25:56,380 --> 00:25:57,910 Huh, my prediction is not that good. 592 00:25:57,910 --> 00:25:58,910 OK. 593 00:25:58,910 --> 00:26:00,160 We'll see. 594 00:26:03,540 --> 00:26:07,800 So you can go all these kinds of things, identify normal 595 00:26:07,800 --> 00:26:09,900 behavior programs to deal with that. 596 00:26:09,900 --> 00:26:13,000 Ad what prefetching says is, if I look at what you have 597 00:26:13,000 --> 00:26:16,610 been taking from the money, there are some patterns, and 598 00:26:16,610 --> 00:26:17,780 I'm expecting you to [UNINTELLIGIBLE] 599 00:26:17,780 --> 00:26:20,630 that pattern, and I will basically get the data, even 600 00:26:20,630 --> 00:26:23,480 though you don't ask for it, because I'm expecting you to 601 00:26:23,480 --> 00:26:26,210 be doing that. 602 00:26:26,210 --> 00:26:33,260 It's like a very good butler who can really think, you ask 603 00:26:33,260 --> 00:26:36,500 something, have that ready for you, basically. 604 00:26:36,500 --> 00:26:39,880 So by doing that, you can get much better parallelism, 605 00:26:39,880 --> 00:26:43,230 because now you're not waiting for that to be sure 606 00:26:43,230 --> 00:26:44,110 that you can do it. 607 00:26:44,110 --> 00:26:46,350 You're just kind of doing this, assuming you might be 608 00:26:46,350 --> 00:26:49,620 using it, and if you do it really well, you're actually 609 00:26:49,620 --> 00:26:50,230 going to use it. 610 00:26:50,230 --> 00:26:53,210 And that will be useful. 611 00:26:53,210 --> 00:26:55,640 So for example, even things like value prediction. 612 00:26:55,640 --> 00:26:57,660 Value prediction says, [UNINTELLIGIBLE] 613 00:26:57,660 --> 00:27:00,740 to get out, you know when you do this complex calculation, 614 00:27:00,740 --> 00:27:03,170 most of the time, data is zero. 615 00:27:03,170 --> 00:27:06,670 You multiply and add a lot of zeroes, and if you're doing 616 00:27:06,670 --> 00:27:08,990 spreadsheets, all it is [UNINTELLIGIBLE] calculations. 617 00:27:08,990 --> 00:27:13,440 So if I don't know the value, let me assume it's 0. 618 00:27:13,440 --> 00:27:14,810 And that actually works most of the time. 619 00:27:14,810 --> 00:27:17,400 Or you look at the last couple of times, what has happened to 620 00:27:17,400 --> 00:27:18,940 that [UNINTELLIGIBLE], and there, 621 00:27:18,940 --> 00:27:20,170 let me predict something. 622 00:27:20,170 --> 00:27:22,380 And people to find these kind of things that are a lot of 623 00:27:22,380 --> 00:27:23,960 patterns in [UNINTELLIGIBLE] program. 624 00:27:23,960 --> 00:27:28,130 And if it works, that's really great. 625 00:27:28,130 --> 00:27:32,820 So in speculation, what you do is first issue and execute. 626 00:27:32,820 --> 00:27:34,540 As if the branch was-- 627 00:27:34,540 --> 00:27:36,690 I already knew where it is going, just by using the 628 00:27:36,690 --> 00:27:37,540 prediction. 629 00:27:37,540 --> 00:27:39,950 And if I just know dynamic scheduling, I am only doing 630 00:27:39,950 --> 00:27:41,100 [UNINTELLIGIBLE] 631 00:27:41,100 --> 00:27:43,400 issues, I am not going to execute until I know 632 00:27:43,400 --> 00:27:44,100 everything is [UNINTELLIGIBLE]. 633 00:27:44,100 --> 00:27:48,030 So it's speculation goes one step beyond. 634 00:27:48,030 --> 00:27:49,900 It's what we call a data flow execution 635 00:27:49,900 --> 00:27:51,480 model, the dynamic schedule. 636 00:27:51,480 --> 00:27:54,690 And that means, the minute the data is ready and everything 637 00:27:54,690 --> 00:27:56,950 is ready, you can execute, even though it might not mean 638 00:27:56,950 --> 00:27:58,200 the same mode of the program. 639 00:27:58,200 --> 00:28:00,090 It might be much later instruction. 640 00:28:00,090 --> 00:28:03,340 But implicitly, you can execute that. 641 00:28:03,340 --> 00:28:05,610 So again, the [UNINTELLIGIBLE] 642 00:28:05,610 --> 00:28:09,240 pipeline, you can do 20 to 24 stage or 643 00:28:09,240 --> 00:28:11,430 16, whatever is correct. 644 00:28:11,430 --> 00:28:14,060 The other thing it does is this crazy thing called 645 00:28:14,060 --> 00:28:17,770 micro-ops that puts everybody out to a loop. 646 00:28:17,770 --> 00:28:20,920 So what happened was-- this is kind of historical-- 647 00:28:20,920 --> 00:28:24,262 Intel, a long time ago, came up with the x86 architecture. 648 00:28:24,262 --> 00:28:26,810 They came up with this instruction set. 649 00:28:26,810 --> 00:28:29,960 That instruction set is called a CISC instruction set. 650 00:28:29,960 --> 00:28:33,240 It's a complex instruction set architecture. 651 00:28:33,240 --> 00:28:35,890 What that means is there are these instructions there are 652 00:28:35,890 --> 00:28:37,800 that are really, really complex. 653 00:28:37,800 --> 00:28:41,840 You can move an entire string from one place to another in 654 00:28:41,840 --> 00:28:43,400 one instruction. 655 00:28:43,400 --> 00:28:46,690 The problem is, most instructions cannot 656 00:28:46,690 --> 00:28:48,750 be done in one cycle. 657 00:28:48,750 --> 00:28:50,420 So what modern [UNINTELLIGIBLE] 658 00:28:50,420 --> 00:28:53,130 did was, they built a [UNINTELLIGIBLE] compiler into 659 00:28:53,130 --> 00:28:54,740 the hardware. 660 00:28:54,740 --> 00:28:57,690 So what happens inside this process is, take this CISC 661 00:28:57,690 --> 00:29:00,760 instruction and compile them down to these particular 662 00:29:00,760 --> 00:29:04,080 micro-operations, that each operation is small can be done 663 00:29:04,080 --> 00:29:06,180 in basically one cycle. 664 00:29:06,180 --> 00:29:08,080 So it's doing this mapping in there. 665 00:29:08,080 --> 00:29:10,780 And then inside the computer, it's basically 666 00:29:10,780 --> 00:29:12,060 dealing with micro-ops. 667 00:29:12,060 --> 00:29:20,420 So up to here, it's dealing with these long instructions, 668 00:29:20,420 --> 00:29:24,270 and here it's precoding and decoding into micro-ops. 669 00:29:24,270 --> 00:29:27,510 So after that, so what that means, every cycle, it can 670 00:29:27,510 --> 00:29:28,610 issue 6 micro-ops. 671 00:29:28,610 --> 00:29:33,540 So there might be 6 instructions you know, 3, or 672 00:29:33,540 --> 00:29:35,700 even 1, depending on how many micro-ops [UNINTELLIGIBLE]. 673 00:29:35,700 --> 00:29:38,490 So if you look at instruction manual in the 674 00:29:38,490 --> 00:29:40,970 [UNINTELLIGIBLE], you can see how many micro-op 675 00:29:40,970 --> 00:29:43,410 [UNINTELLIGIBLE] instruction it's going into, and if it's 676 00:29:43,410 --> 00:29:46,000 large, that means those instructions are slow. 677 00:29:46,000 --> 00:29:49,480 And you can have this 120 micro-op waiting to be 678 00:29:49,480 --> 00:29:54,920 executed, sitting in this reservation station. 679 00:29:54,920 --> 00:29:55,650 Basically, they are waiting there. 680 00:29:55,650 --> 00:29:58,760 The minute the data is available, and then a slot 681 00:29:58,760 --> 00:30:00,790 available, it will go down, get executed, and come back. 682 00:30:03,740 --> 00:30:06,750 So branch prediction, basically-- 683 00:30:06,750 --> 00:30:09,540 the problem with branches is if you have this very nice 684 00:30:09,540 --> 00:30:14,670 pipeline, what happens is, branch target is not known 685 00:30:14,670 --> 00:30:16,480 until this target detection time. 686 00:30:16,480 --> 00:30:17,230 Where is my mouse? 687 00:30:17,230 --> 00:30:18,050 OK. 688 00:30:18,050 --> 00:30:20,960 Until this point, I don't know whether the branch, where I am 689 00:30:20,960 --> 00:30:22,910 supposed to go. 690 00:30:22,910 --> 00:30:25,500 And so how do I go ahead and get the instruction, because I 691 00:30:25,500 --> 00:30:27,290 don't know where I'm going until this point. 692 00:30:27,290 --> 00:30:29,340 Even worse, I might not even know address. 693 00:30:29,340 --> 00:30:30,250 Some [UNINTELLIGIBLE] 694 00:30:30,250 --> 00:30:32,690 you are going AOB, you are taken or not taken. 695 00:30:32,690 --> 00:30:34,060 But sometimes there might be [UNINTELLIGIBLE] 696 00:30:34,060 --> 00:30:34,540 branches. 697 00:30:34,540 --> 00:30:36,820 I don't even, if I'm returning the return 698 00:30:36,820 --> 00:30:37,990 address in the stack. 699 00:30:37,990 --> 00:30:41,510 Until I go fit the stack and load it and all those things, 700 00:30:41,510 --> 00:30:44,320 I don't know where I'm going, so I had to 701 00:30:44,320 --> 00:30:45,710 wait until this point. 702 00:30:45,710 --> 00:30:48,250 So if I do it, basically I have to 703 00:30:48,250 --> 00:30:49,170 wait until this pipeline. 704 00:30:49,170 --> 00:30:50,510 All these things are going to stalled. 705 00:30:50,510 --> 00:30:51,665 Nothing happens in here. 706 00:30:51,665 --> 00:30:52,670 This is first stage. 707 00:30:52,670 --> 00:30:53,820 You can have a interested pipeline. 708 00:30:53,820 --> 00:30:57,730 You are waiting 20 stages, just stalled, doing nothing. 709 00:30:57,730 --> 00:30:59,730 And then there's superscalar plus 20. 710 00:30:59,730 --> 00:31:03,210 So that means 20 cycles without the [? 6 issues ?] 711 00:31:03,210 --> 00:31:04,110 you can do. 712 00:31:04,110 --> 00:31:11,230 So 6 plus 20 is 120 possible instructions wasted, and it's 713 00:31:11,230 --> 00:31:12,050 a lot of waste. 714 00:31:12,050 --> 00:31:13,710 So you don't want to do that. 715 00:31:13,710 --> 00:31:16,920 So what you do is, you build a predictor to figure out which 716 00:31:16,920 --> 00:31:18,240 direction the brand is going. 717 00:31:18,240 --> 00:31:20,590 And depending on what the predictor do, and the neat 718 00:31:20,590 --> 00:31:22,470 thing is, these days, these predictions are 719 00:31:22,470 --> 00:31:23,920 really, really good. 720 00:31:23,920 --> 00:31:27,520 They can predict up to 99 points plus [UNINTELLIGIBLE] 721 00:31:27,520 --> 00:31:28,170 where you're going. 722 00:31:28,170 --> 00:31:30,000 So it's like me telling you, you guys are all going to 723 00:31:30,000 --> 00:31:32,910 sleep afterwards, and this is going to sleep on me. 724 00:31:36,000 --> 00:31:40,150 So what you can predict, I will say, aha! 725 00:31:40,150 --> 00:31:41,930 I tell you, you are going in that direction. 726 00:31:41,930 --> 00:31:44,600 This is like going to [UNINTELLIGIBLE]. 727 00:31:44,600 --> 00:31:46,300 I think you're going that direction, and 728 00:31:46,300 --> 00:31:47,690 then you just go there. 729 00:31:47,690 --> 00:31:52,440 And 99% of the time, you're OK, and when [UNINTELLIGIBLE] 730 00:31:52,440 --> 00:31:55,220 finally decided, you can say, oh, that was slight. 731 00:31:55,220 --> 00:31:58,460 If you're wrong, you squash everything and restart. 732 00:31:58,460 --> 00:32:02,270 And so modern predictors are this very complicated beast. 733 00:32:02,270 --> 00:32:05,470 It looks at things like what happened previously. 734 00:32:05,470 --> 00:32:08,410 It looks at [UNINTELLIGIBLE] call stack where are the call 735 00:32:08,410 --> 00:32:10,170 stack, what went into the call stack. 736 00:32:10,170 --> 00:32:11,830 And then when you're returning, you can predict 737 00:32:11,830 --> 00:32:12,870 where we might be returning. 738 00:32:12,870 --> 00:32:16,380 A lot of different things go in there, and in fact, we see 739 00:32:16,380 --> 00:32:19,600 the core microarchitecture that came before Nehalem's 740 00:32:19,600 --> 00:32:21,800 predictor, and Nehalem [UNINTELLIGIBLE], oh, it even 741 00:32:21,800 --> 00:32:24,380 has more things, and so it's even more different. 742 00:32:24,380 --> 00:32:26,480 So these guys are doing these complex things. 743 00:32:26,480 --> 00:32:31,790 What that means is, sometimes it affects program in a way 744 00:32:31,790 --> 00:32:34,360 that you think, oh, huh, this is something that I don't know 745 00:32:34,360 --> 00:32:35,120 what might happen. 746 00:32:35,120 --> 00:32:35,440 [UNINTELLIGIBLE] 747 00:32:35,440 --> 00:32:37,890 this condition, I might not know what might happen. 748 00:32:37,890 --> 00:32:40,300 But model prediction sometimes surprises you. 749 00:32:40,300 --> 00:32:43,270 Aha, I know there's a pattern, I recognize that pattern, and 750 00:32:43,270 --> 00:32:44,200 I go about that. 751 00:32:44,200 --> 00:32:46,670 And every architecture, I try different kinds of patterns, 752 00:32:46,670 --> 00:32:49,050 and I find the architectures get better and better. 753 00:32:49,050 --> 00:32:52,770 So for example, at some point, if you go odd-even, odd-even 754 00:32:52,770 --> 00:32:57,350 branches, if you assume I have a loop, I say, if i is odd, 755 00:32:57,350 --> 00:33:00,150 going one direction, otherwise not. 756 00:33:00,150 --> 00:33:01,660 Pentium 3 type thing, OK. 757 00:33:01,660 --> 00:33:03,040 It just screw up [UNINTELLIGIBLE] 758 00:33:03,040 --> 00:33:04,240 with every time it's doing something different. 759 00:33:04,240 --> 00:33:05,912 I don't know what's going on. 760 00:33:05,912 --> 00:33:06,035 [UNINTELLIGIBLE] 761 00:33:06,035 --> 00:33:08,030 Pentium I'd probably say aha, there's a pattern, odd-even, 762 00:33:08,030 --> 00:33:09,730 odd-even, it figure out. 763 00:33:09,730 --> 00:33:13,140 I core 2 figures out things like every third, every fourth 764 00:33:13,140 --> 00:33:14,630 type pattern it figures out. 765 00:33:14,630 --> 00:33:16,510 And then things get better and better. 766 00:33:16,510 --> 00:33:19,580 There's very complicated patterns you do, these guys 767 00:33:19,580 --> 00:33:20,290 manage to figure out. 768 00:33:20,290 --> 00:33:22,600 Which is kind of fun, to see what it can figure 769 00:33:22,600 --> 00:33:23,710 out, what it can't. 770 00:33:23,710 --> 00:33:26,660 By writing a very small program, you can get that. 771 00:33:26,660 --> 00:33:28,980 I show some of that. 772 00:33:28,980 --> 00:33:32,700 So the next thing that you have to worry about is the 773 00:33:32,700 --> 00:33:34,130 memory system. 774 00:33:34,130 --> 00:33:36,410 So the memory system, if you want to be the computer, what 775 00:33:36,410 --> 00:33:39,550 you want to do is have a lot of very fast memory 776 00:33:39,550 --> 00:33:41,420 very close to you. 777 00:33:41,420 --> 00:33:42,030 You can do that. 778 00:33:42,030 --> 00:33:43,080 It's going to be very expensive. 779 00:33:43,080 --> 00:33:45,190 Sometimes through SQL you can't even attain that, 780 00:33:45,190 --> 00:33:48,010 because by the time you build a lot of things, it's actually 781 00:33:48,010 --> 00:33:49,675 already far away from you, because we are 782 00:33:49,675 --> 00:33:50,910 running very fast here. 783 00:33:50,910 --> 00:33:53,610 So the way you deal with that is building caches. 784 00:33:53,610 --> 00:33:56,522 That means you keep a small amount of things close to you 785 00:33:56,522 --> 00:34:01,140 in there, and you put things in that cache in a way that 786 00:34:01,140 --> 00:34:02,720 hopefully, those are the things you 787 00:34:02,720 --> 00:34:03,940 will be using anyways. 788 00:34:03,940 --> 00:34:06,850 So it works very nicely. 789 00:34:06,850 --> 00:34:11,210 So the reason that can be done is in programming, when you 790 00:34:11,210 --> 00:34:13,949 run a program, there are two types of behaviors that it 791 00:34:13,949 --> 00:34:15,480 [UNINTELLIGIBLE] to take advantage of. 792 00:34:15,480 --> 00:34:17,940 One thing is called temporal locality. 793 00:34:17,940 --> 00:34:21,120 What that means is, most programs, if you use you some 794 00:34:21,120 --> 00:34:24,150 data, there's a good chance you are going to use the same 795 00:34:24,150 --> 00:34:26,449 data very soon. 796 00:34:26,449 --> 00:34:29,800 Because normally, there's a thing called a working set. 797 00:34:29,800 --> 00:34:31,560 I am working with a certain set of data. 798 00:34:31,560 --> 00:34:33,590 I'm basically touching these things again and again. 799 00:34:33,590 --> 00:34:35,840 So if I use that, I'm going to use that data again. 800 00:34:35,840 --> 00:34:40,159 So that means I want to keep the data that I used close by. 801 00:34:40,159 --> 00:34:42,290 Other one is called special locality. 802 00:34:42,290 --> 00:34:46,960 That means, if I use some data, I have a very good 803 00:34:46,960 --> 00:34:49,610 chance that I might be using a neighboring data item. 804 00:34:49,610 --> 00:34:52,940 Because if I am accessing a structure or an array, if I'm 805 00:34:52,940 --> 00:34:54,630 accessing something, I might be accessing a 806 00:34:54,630 --> 00:34:56,139 [UNINTELLIGIBLE] neighbor thing next. 807 00:34:56,139 --> 00:34:58,110 So there's what's called partial locality. 808 00:34:58,110 --> 00:35:02,290 And taking advantage of these two, you can build this very 809 00:35:02,290 --> 00:35:05,170 fast memory system that feels like you have a huge amount of 810 00:35:05,170 --> 00:35:06,560 memory very close to you. 811 00:35:06,560 --> 00:35:08,070 Of course, the opposite is true. 812 00:35:08,070 --> 00:35:10,830 If your program doesn't behave like that, it looks like you 813 00:35:10,830 --> 00:35:13,900 have lots of memory very far from you, and the program 814 00:35:13,900 --> 00:35:15,150 becomes very, very slow. 815 00:35:20,340 --> 00:35:25,170 Because memories didn't speed up as fast as processors. 816 00:35:25,170 --> 00:35:27,660 So what that means is, every generation, the memory 817 00:35:27,660 --> 00:35:30,230 [UNINTELLIGIBLE] further and further away, slower and 818 00:35:30,230 --> 00:35:31,320 slower and slower. 819 00:35:31,320 --> 00:35:33,220 And this is how we manage to keep the 820 00:35:33,220 --> 00:35:35,630 machines running fast. 821 00:35:35,630 --> 00:35:39,700 So if you look at what's going on every level, I just gave 822 00:35:39,700 --> 00:35:40,180 you a diagram. 823 00:35:40,180 --> 00:35:41,440 I won't go through detail. 824 00:35:41,440 --> 00:35:42,840 You can look at it later. 825 00:35:42,840 --> 00:35:46,570 There are caches, and higher levels you can 826 00:35:46,570 --> 00:35:47,400 only keep very small. 827 00:35:47,400 --> 00:35:49,750 And one like [UNINTELLIGIBLE], you can only keep a couple of 828 00:35:49,750 --> 00:35:50,910 hundred bytes in the registers. 829 00:35:50,910 --> 00:35:53,020 But very fast exercise. 830 00:35:53,020 --> 00:35:53,930 Then you go to cache. 831 00:35:53,930 --> 00:35:55,750 I see on-chip stuff. 832 00:35:55,750 --> 00:35:58,490 We can keep a little bit more, access time go down, and then 833 00:35:58,490 --> 00:36:01,130 finally when you go to main memory, you can keep a huge 834 00:36:01,130 --> 00:36:02,750 amount, slow access. 835 00:36:02,750 --> 00:36:05,950 And even at the end, something like tape, you can keep a huge 836 00:36:05,950 --> 00:36:06,940 amount of stuff. 837 00:36:06,940 --> 00:36:08,480 Almost infinite amount of stuff. 838 00:36:08,480 --> 00:36:12,250 But it takes hours to get the tape access is not 839 00:36:12,250 --> 00:36:12,780 [UNINTELLIGIBLE]. 840 00:36:12,780 --> 00:36:16,320 So there's this hierarchy, and there are a certain amount 841 00:36:16,320 --> 00:36:17,550 things you want. 842 00:36:17,550 --> 00:36:18,750 You are using [UNINTELLIGIBLE] 843 00:36:18,750 --> 00:36:19,255 very fast. 844 00:36:19,255 --> 00:36:20,490 You don't want to put it in tape. 845 00:36:20,490 --> 00:36:23,130 But you know something that you might access in a year, 846 00:36:23,130 --> 00:36:24,840 that's probably a good place to put it. 847 00:36:24,840 --> 00:36:28,070 So there's this hierarchy there. 848 00:36:28,070 --> 00:36:32,390 And then you have cache, what happens is, you want to give 849 00:36:32,390 --> 00:36:34,790 the illusion everything is there, but when you go to 850 00:36:34,790 --> 00:36:37,430 access it, the data might not be there. 851 00:36:37,430 --> 00:36:40,110 There are many reasons it might not be there, so let's 852 00:36:40,110 --> 00:36:41,850 go through some of the reasons. 853 00:36:41,850 --> 00:36:43,610 So one thing is called cold miss. 854 00:36:43,610 --> 00:36:46,170 That means it's the first time I've seen that data. 855 00:36:46,170 --> 00:36:48,240 So it cannot be in the cache. 856 00:36:48,240 --> 00:36:49,990 If it's the first time I was asking the disk, it's probably 857 00:36:49,990 --> 00:36:52,120 sitting very back in main memory, or somewhere in the 858 00:36:52,120 --> 00:36:54,520 disk, and I have to go get it. 859 00:36:54,520 --> 00:36:57,910 One way to get around this problem is prefetching. 860 00:36:57,910 --> 00:37:00,130 So if I keep accessing this I say, aha. 861 00:37:00,130 --> 00:37:02,350 You're accessing this in this pattern. 862 00:37:02,350 --> 00:37:05,250 And there's a good probability the next pattern is here, and 863 00:37:05,250 --> 00:37:07,246 OK, go get it very fast, because you're coming in that 864 00:37:07,246 --> 00:37:08,170 sort of direction. 865 00:37:08,170 --> 00:37:11,230 So by doing prefetching, you might be able to get rid of 866 00:37:11,230 --> 00:37:13,530 [UNINTELLIGIBLE] is happening. 867 00:37:13,530 --> 00:37:15,530 Then there's a thing called capacity miss. 868 00:37:15,530 --> 00:37:18,550 What that means is, you're accessing a lot of data. 869 00:37:18,550 --> 00:37:22,470 At some point, you have accessed enough data, 870 00:37:22,470 --> 00:37:25,110 everything you accessed cannot be fit in the cache, and you 871 00:37:25,110 --> 00:37:27,720 have to throw out some of the data to put the next one. 872 00:37:27,720 --> 00:37:29,900 So most caches use the a policy called 873 00:37:29,900 --> 00:37:30,960 least recently used. 874 00:37:30,960 --> 00:37:34,590 That means the data that you unpacked for the longest time 875 00:37:34,590 --> 00:37:36,860 gets thrown out of the cache. 876 00:37:36,860 --> 00:37:38,632 But the problem is, at some point, you're going to come 877 00:37:38,632 --> 00:37:40,340 back to it. 878 00:37:40,340 --> 00:37:43,410 And if you come back to it after a long time, that thing 879 00:37:43,410 --> 00:37:45,340 has been out of that level of cache. 880 00:37:45,340 --> 00:37:46,600 This happened at every level. 881 00:37:46,600 --> 00:37:48,210 So what that means is, you have a thing called a working 882 00:37:48,210 --> 00:37:50,770 set, that you keep working again and using. 883 00:37:50,770 --> 00:37:53,520 If the working set is a little bit larger than the cache, 884 00:37:53,520 --> 00:37:55,780 when you come back to that, the data is gone. 885 00:37:55,780 --> 00:37:56,610 It's not in the cache. 886 00:37:56,610 --> 00:37:58,680 So you want to kind of create a working set that fits in the 887 00:37:58,680 --> 00:38:01,850 cache nicely, at ever level, basically. 888 00:38:01,850 --> 00:38:04,170 You can do something like pre-fetching and gets around 889 00:38:04,170 --> 00:38:06,290 it, but if you can avoid [UNINTELLIGIBLE], that's 890 00:38:06,290 --> 00:38:08,005 really nice. 891 00:38:08,005 --> 00:38:11,370 Then there's a thing called a conflict miss. 892 00:38:11,370 --> 00:38:17,640 One way of cache work-- it doesn't let you store 893 00:38:17,640 --> 00:38:21,100 everything in every location. 894 00:38:21,100 --> 00:38:23,570 Sometimes there are some locations in cache 895 00:38:23,570 --> 00:38:24,140 [UNINTELLIGIBLE] 896 00:38:24,140 --> 00:38:26,030 associated with it. 897 00:38:26,030 --> 00:38:27,280 Did you get [UNINTELLIGIBLE] associated in 004?? 898 00:38:30,960 --> 00:38:32,060 Associativity? 899 00:38:32,060 --> 00:38:32,860 OK. 900 00:38:32,860 --> 00:38:35,140 Again, I will talk about cache associated [UNINTELLIGIBLE], 901 00:38:35,140 --> 00:38:37,500 but I'm going to talk about memory system later and 902 00:38:37,500 --> 00:38:38,690 [UNINTELLIGIBLE] there. 903 00:38:38,690 --> 00:38:42,140 And what that means is, this is a really bad behavior, 904 00:38:42,140 --> 00:38:48,180 because you are only accessing a small amount of things, but 905 00:38:48,180 --> 00:38:50,135 all of the data, all the aggregate can't 906 00:38:50,135 --> 00:38:51,580 fit into the cache. 907 00:38:51,580 --> 00:38:53,970 The pattern makes it only fit into a small part of the 908 00:38:53,970 --> 00:38:55,510 cache, so you had to throw the data out. 909 00:38:55,510 --> 00:38:56,250 I'll get back to this. 910 00:38:56,250 --> 00:38:58,140 I'm just going to put it there. 911 00:38:58,140 --> 00:39:01,340 And then in a multiprocess system, [UNINTELLIGIBLE] 912 00:39:01,340 --> 00:39:03,580 multicourse, there's a thing called true sharing. 913 00:39:03,580 --> 00:39:05,870 That means I [UNINTELLIGIBLE] data [UNINTELLIGIBLE], and the 914 00:39:05,870 --> 00:39:06,740 next time [UNINTELLIGIBLE] 915 00:39:06,740 --> 00:39:07,690 want to trust the data. 916 00:39:07,690 --> 00:39:10,200 So I have to give the data to you, so you're getting cache. 917 00:39:10,200 --> 00:39:12,560 And then when I want to use my data, it's [UNINTELLIGIBLE]. 918 00:39:12,560 --> 00:39:13,610 I have to get it back to me. 919 00:39:13,610 --> 00:39:17,120 So I'm kind of using data back and forth, back and forth. 920 00:39:17,120 --> 00:39:19,800 So that's called true sharing. 921 00:39:19,800 --> 00:39:23,560 And a more hideous form of that is called false sharing. 922 00:39:23,560 --> 00:39:27,130 That means most of the data is in a cache line. 923 00:39:27,130 --> 00:39:29,560 When you move data you move the cache line. 924 00:39:29,560 --> 00:39:33,995 So what I'm doing, is I'm using my data myself, and you 925 00:39:33,995 --> 00:39:36,570 are using some other data, but unfortunately, they're in the 926 00:39:36,570 --> 00:39:38,760 same cache line. 927 00:39:38,760 --> 00:39:41,300 So when I say, OK, I want my data, I get the entire cache 928 00:39:41,300 --> 00:39:43,710 line, including your data, and then you 929 00:39:43,710 --> 00:39:44,640 want to use your data. 930 00:39:44,640 --> 00:39:46,480 You say, oops, it's not with me. 931 00:39:46,480 --> 00:39:48,430 And I then that means you [UNINTELLIGIBLE] 932 00:39:48,430 --> 00:39:51,260 go to you, and at that point, I don't have 933 00:39:51,260 --> 00:39:52,330 my data in my cache. 934 00:39:52,330 --> 00:39:54,200 So this data is going back and forth. 935 00:39:54,200 --> 00:39:57,720 Even though I never touch your data, we have two separate 936 00:39:57,720 --> 00:39:59,330 data, but it's going back. 937 00:39:59,330 --> 00:40:01,870 I will get this things in a lot more detail in later 938 00:40:01,870 --> 00:40:03,760 lectures, so this is kind of giving [INAUDIBLE]. 939 00:40:03,760 --> 00:40:05,950 So there are all of these different ways that the data 940 00:40:05,950 --> 00:40:08,770 can be not in the cache. 941 00:40:08,770 --> 00:40:14,350 So here is what the processor you are 942 00:40:14,350 --> 00:40:15,700 working with looks like. 943 00:40:15,700 --> 00:40:19,230 There are 6 cores, and there are L1 separate instruction 944 00:40:19,230 --> 00:40:22,950 and data caches, and then there's L2 unified cache, 945 00:40:22,950 --> 00:40:23,910 [UNINTELLIGIBLE] 946 00:40:23,910 --> 00:40:27,850 L3 cache or [UNINTELLIGIBLE], and then there's main memory. 947 00:40:27,850 --> 00:40:29,830 So it's even this deep, deep, deep cache 948 00:40:29,830 --> 00:40:30,700 hierarchy going on then. 949 00:40:30,700 --> 00:40:33,730 And then right-hand side, I kind of showed the difference. 950 00:40:33,730 --> 00:40:38,690 And the interesting thing to realize is, the L1 cache 951 00:40:38,690 --> 00:40:40,610 delays about 4 nanoseconds. 952 00:40:40,610 --> 00:40:48,550 It gets a little bit, 2 1/2 times slower when you go to L2 953 00:40:48,550 --> 00:40:52,250 cache, and [? 12 ?] times slower when you go to L3 954 00:40:52,250 --> 00:40:55,540 cache, and even more slow when you go to main memory. 955 00:40:55,540 --> 00:40:57,560 So every time you go down, it gets slower 956 00:40:57,560 --> 00:40:59,020 and slower and slower. 957 00:40:59,020 --> 00:41:01,130 That's basically the gist of it in here. 958 00:41:04,020 --> 00:41:06,404 Question? 959 00:41:06,404 --> 00:41:10,292 AUDIENCE: Each core has two L1 cache for instructions and 960 00:41:10,292 --> 00:41:10,778 [INAUDIBLE]? 961 00:41:10,778 --> 00:41:12,780 SAMAN AMARASINGHE: And instruction data. 962 00:41:12,780 --> 00:41:14,040 So instruction goes one direction, 963 00:41:14,040 --> 00:41:16,440 data goes in one direction. 964 00:41:16,440 --> 00:41:20,900 So next I want to talk a little bit about profiling. 965 00:41:20,900 --> 00:41:24,070 So predictor who is going to be-- 966 00:41:24,070 --> 00:41:25,250 the first part of the predictor 967 00:41:25,250 --> 00:41:27,360 is all about profiling. 968 00:41:27,360 --> 00:41:29,470 Profiling is very important, because you run a program. 969 00:41:29,470 --> 00:41:31,740 It doesn't do well. 970 00:41:31,740 --> 00:41:35,750 So how do you know what's going on? 971 00:41:35,750 --> 00:41:38,040 First of all, you do know, even if it's not doing well. 972 00:41:38,040 --> 00:41:40,140 And if it's not doing well, what's the reason? 973 00:41:40,140 --> 00:41:43,530 So just having one number, that then saying it's 974 00:41:43,530 --> 00:41:45,310 [UNINTELLIGIBLE] in 10 minutes, [UNINTELLIGIBLE] 975 00:41:45,310 --> 00:41:47,090 5 minutes, doesn't tell you too much. 976 00:41:47,090 --> 00:41:49,150 You know what to look if you have a large program. 977 00:41:49,150 --> 00:41:51,960 The profiling means going, looking at what the program is 978 00:41:51,960 --> 00:41:55,310 doing to get an understanding in there. 979 00:41:55,310 --> 00:41:59,560 So what you want to do was collect this performance data 980 00:41:59,560 --> 00:42:03,260 while running the application, and then if you're a large 981 00:42:03,260 --> 00:42:06,410 application, you want to organize and display data in 982 00:42:06,410 --> 00:42:11,110 that a variety of ways, and a lot of times, relate that data 983 00:42:11,110 --> 00:42:13,050 to source code, and see what that means in the 984 00:42:13,050 --> 00:42:13,410 [UNINTELLIGIBLE] 985 00:42:13,410 --> 00:42:13,960 code. 986 00:42:13,960 --> 00:42:16,620 And hopefully by looking at this one, you can identify, 987 00:42:16,620 --> 00:42:19,950 ha, there's a problem here. 988 00:42:19,950 --> 00:42:22,130 So there are a bunch of tools. 989 00:42:22,130 --> 00:42:24,980 Intel Vtune, gprof, oprofile, perf. 990 00:42:24,980 --> 00:42:29,170 So you guys are going to use mainly perf, and we will 991 00:42:29,170 --> 00:42:34,970 probably talk a little bit about gprof. 992 00:42:34,970 --> 00:42:37,640 And next time when you come, next lecture we'll talk about 993 00:42:37,640 --> 00:42:40,270 this a lot more. 994 00:42:40,270 --> 00:42:45,690 So profiling is mainly to find where in an application or a 995 00:42:45,690 --> 00:42:48,710 system there is a significant amount of activity. 996 00:42:48,710 --> 00:42:50,710 And when the significant amount of activity you want to 997 00:42:50,710 --> 00:42:53,350 know whether those activity can be avoided, or there's a 998 00:42:53,350 --> 00:42:54,470 problem here. 999 00:42:54,470 --> 00:42:57,700 So there are some [UNINTELLIGIBLE] 1000 00:42:57,700 --> 00:43:01,970 significant to what's there, it could be anywhere. 1001 00:43:01,970 --> 00:43:03,360 It might be addressed in the memory, 1002 00:43:03,360 --> 00:43:04,840 something might be happening. 1003 00:43:04,840 --> 00:43:06,630 It might be in the [UNINTELLIGIBLE] 1004 00:43:06,630 --> 00:43:09,570 system, some kind of process in the operating system might 1005 00:43:09,570 --> 00:43:11,950 be happening in one thread, it might be happening in an 1006 00:43:11,950 --> 00:43:14,510 executable file or a module. 1007 00:43:14,510 --> 00:43:16,340 If you know the symbol of something, you can say, oh, 1008 00:43:16,340 --> 00:43:19,400 huh, that mode, that actually is this function, and if you 1009 00:43:19,400 --> 00:43:21,692 know even more information about the program, you can 1010 00:43:21,692 --> 00:43:23,060 say, aha, that's false, and in fact it's 1011 00:43:23,060 --> 00:43:24,680 this line of the program. 1012 00:43:24,680 --> 00:43:27,900 So you get this information from the application when you 1013 00:43:27,900 --> 00:43:31,220 compile that you can say debug information, say aha, this 1014 00:43:31,220 --> 00:43:32,810 happens in this line of the program 1015 00:43:32,810 --> 00:43:33,880 that's having this problem. 1016 00:43:33,880 --> 00:43:36,180 So you can break it down and get that info like that. 1017 00:43:36,180 --> 00:43:39,030 [PHONE RINGING] 1018 00:43:39,030 --> 00:43:39,680 Somebody's trying to call me. 1019 00:43:39,680 --> 00:43:40,015 Hold on. 1020 00:43:40,015 --> 00:43:42,300 OK. 1021 00:43:42,300 --> 00:43:45,500 Secondly, we care about significant activity. 1022 00:43:45,500 --> 00:43:49,890 If the activity just happened only a few times, happened 1023 00:43:49,890 --> 00:43:52,040 only nanoseconds of execution time, who cares? 1024 00:43:52,040 --> 00:43:52,640 You don't want to do that. 1025 00:43:52,640 --> 00:43:55,250 You want to focus on things that matter. 1026 00:43:55,250 --> 00:43:58,060 And the key thing about this [UNINTELLIGIBLE]. 1027 00:43:58,060 --> 00:44:01,390 If you spend most of your time in insignificant things, you 1028 00:44:01,390 --> 00:44:03,600 would get insignificant performance improvement. 1029 00:44:03,600 --> 00:44:06,370 If you found the significant part and work on that, you can 1030 00:44:06,370 --> 00:44:07,410 get significant gains. 1031 00:44:07,410 --> 00:44:09,040 So the key thing is to find the significant ones. 1032 00:44:09,040 --> 00:44:10,018 [PHONE RINGING] 1033 00:44:10,018 --> 00:44:11,268 Excuse me. 1034 00:44:20,290 --> 00:44:22,010 OK. 1035 00:44:22,010 --> 00:44:23,895 And the final activity. 1036 00:44:26,680 --> 00:44:30,290 So activity means time is spent in doing something, and 1037 00:44:30,290 --> 00:44:31,610 some activities are bad ones. 1038 00:44:31,610 --> 00:44:32,770 Like if you actually do running 1039 00:44:32,770 --> 00:44:33,690 instruction, that's good. 1040 00:44:33,690 --> 00:44:35,820 If your running is useful instruction, a long 1041 00:44:35,820 --> 00:44:36,770 time, you're OK. 1042 00:44:36,770 --> 00:44:39,040 But you might doing actually [UNINTELLIGIBLE], 1043 00:44:39,040 --> 00:44:39,390 [UNINTELLIGIBLE] 1044 00:44:39,390 --> 00:44:41,280 [? misprediction ?], stuff like 1045 00:44:41,280 --> 00:44:42,500 that, that's bad activity. 1046 00:44:42,500 --> 00:44:43,270 You [UNINTELLIGIBLE] say, aha. 1047 00:44:43,270 --> 00:44:45,620 I am spending a lot of time doing something that I can 1048 00:44:45,620 --> 00:44:47,020 avoid, and how do I avoid that? 1049 00:44:47,020 --> 00:44:50,320 So that's what you want to look at. 1050 00:44:50,320 --> 00:44:54,190 And you have two ways of doing that, and I want to give you 1051 00:44:54,190 --> 00:44:57,300 an analogy to figure out what it is. 1052 00:44:57,300 --> 00:44:59,790 So assume you are going, visiting a bunch 1053 00:44:59,790 --> 00:45:02,360 of different places. 1054 00:45:02,360 --> 00:45:03,750 You're on a city tour. 1055 00:45:03,750 --> 00:45:05,920 You are visiting different parts of the city. 1056 00:45:05,920 --> 00:45:09,440 And I want to figure out, where do you spend 1057 00:45:09,440 --> 00:45:12,440 most of your time? 1058 00:45:12,440 --> 00:45:15,950 And that's a hard problem, because I am sitting in my 1059 00:45:15,950 --> 00:45:17,630 office, and say, OK, you're going [UNINTELLIGIBLE] 1060 00:45:17,630 --> 00:45:17,890 the city. 1061 00:45:17,890 --> 00:45:20,020 I want to figure out where you spend most of the time. 1062 00:45:20,020 --> 00:45:22,180 I have two ways of doing that. 1063 00:45:22,180 --> 00:45:22,925 I'm a busy person. 1064 00:45:22,925 --> 00:45:25,880 What I can do is every 30 minutes, I can call you and 1065 00:45:25,880 --> 00:45:26,920 say, where are you? 1066 00:45:26,920 --> 00:45:29,230 And you say, OK, I'm in this library, I'm in this-- 1067 00:45:29,230 --> 00:45:31,310 and then at that point, I can have a histogram and say, he's 1068 00:45:31,310 --> 00:45:32,640 still in the library. 1069 00:45:32,640 --> 00:45:34,260 And I can call again, where are you? 1070 00:45:34,260 --> 00:45:35,270 And I can log that. 1071 00:45:35,270 --> 00:45:37,000 And at some point, at the end of the day, I'll have a 1072 00:45:37,000 --> 00:45:40,820 histogram saying, every time I call, you found something. 1073 00:45:40,820 --> 00:45:43,660 If you spend a lot of time in the science museum, I will 1074 00:45:43,660 --> 00:45:45,100 have a bunch of [UNINTELLIGIBLE]. 1075 00:45:45,100 --> 00:45:46,080 We'll see [UNINTELLIGIBLE] 1076 00:45:46,080 --> 00:45:50,020 become then, and if you are not [UNINTELLIGIBLE], if you 1077 00:45:50,020 --> 00:45:51,470 have only one [UNINTELLIGIBLE] 1078 00:45:51,470 --> 00:45:53,690 find out, OK, you're not spending time there. 1079 00:45:53,690 --> 00:45:56,600 That's one way to do that. 1080 00:45:56,600 --> 00:45:58,250 Let's go [UNINTELLIGIBLE] 1081 00:45:58,250 --> 00:46:02,570 saying, for every landmark, I create a telephone booth. 1082 00:46:02,570 --> 00:46:04,920 And every time you enter a landmark, you [UNINTELLIGIBLE] 1083 00:46:04,920 --> 00:46:07,260 the telephone booth, you've got to call an operator and 1084 00:46:07,260 --> 00:46:08,850 say what time it is, so [UNINTELLIGIBLE]. 1085 00:46:08,850 --> 00:46:10,790 And then you call me and say, aha, I'm entering this 1086 00:46:10,790 --> 00:46:13,530 landmark, the time now is 5:50, and I write it down. 1087 00:46:13,530 --> 00:46:15,510 And every time you exit the landmark, you call me and say, 1088 00:46:15,510 --> 00:46:16,670 I'm exiting this landmark. 1089 00:46:16,670 --> 00:46:17,250 Time now is-- 1090 00:46:17,250 --> 00:46:18,510 I write it down. 1091 00:46:18,510 --> 00:46:19,040 OK? 1092 00:46:19,040 --> 00:46:20,640 These are kind of two ways of doing that. 1093 00:46:20,640 --> 00:46:22,780 That's like an instrumentation solution. 1094 00:46:22,780 --> 00:46:26,580 So the sampling collector-based periodically 1095 00:46:26,580 --> 00:46:30,980 interupt the processor and look at where you are, and 1096 00:46:30,980 --> 00:46:35,030 depending on where you are, you can mark that off. 1097 00:46:35,030 --> 00:46:37,960 And it's called time-based sampling means every time, 1098 00:46:37,960 --> 00:46:41,190 every 100 milliseconds or something, in [UNINTELLIGIBLE] 1099 00:46:41,190 --> 00:46:44,360 processing, you stop and say where you are. 1100 00:46:44,360 --> 00:46:48,120 Event-based sampling means you are counting number of events 1101 00:46:48,120 --> 00:46:49,310 like cache meters. 1102 00:46:49,310 --> 00:46:51,930 Every hundred cache meters, you stop and say, OK, where 1103 00:46:51,930 --> 00:46:53,180 has this happened? 1104 00:46:53,180 --> 00:46:55,860 Of course, if the cache misses happen in a regular pattern, 1105 00:46:55,860 --> 00:46:57,410 you might be in trouble, because every hundred, you 1106 00:46:57,410 --> 00:47:00,130 might be the same place, then you would get a skewed number. 1107 00:47:00,130 --> 00:47:01,610 But most probably, there's a statistical 1108 00:47:01,610 --> 00:47:03,480 variation in there. 1109 00:47:03,480 --> 00:47:04,960 You can get [? account index ?] 1110 00:47:04,960 --> 00:47:06,300 and figure out where these things happen. 1111 00:47:06,300 --> 00:47:08,100 So if you're looking at where all the cache [UNINTELLIGIBLE] 1112 00:47:08,100 --> 00:47:09,910 is happening, you don't look at every miss, because there 1113 00:47:09,910 --> 00:47:11,230 are millions of miss. 1114 00:47:11,230 --> 00:47:13,740 Every 10,000 miss, you figure out where you are, and 1115 00:47:13,740 --> 00:47:17,290 statistically, one is missing many times, that will adapt in 1116 00:47:17,290 --> 00:47:19,550 that column. 1117 00:47:19,550 --> 00:47:22,200 And nice thing about that is, there's 1118 00:47:22,200 --> 00:47:23,120 nothing you need to do. 1119 00:47:23,120 --> 00:47:26,944 Now installation, no changes to application you need to do. 1120 00:47:26,944 --> 00:47:29,240 [UNINTELLIGIBLE] here, you know how to go in changes, 1121 00:47:29,240 --> 00:47:31,560 phone booth everywhere. 1122 00:47:31,560 --> 00:47:33,100 Wide coverage. 1123 00:47:33,100 --> 00:47:34,180 You can cover everything. 1124 00:47:34,180 --> 00:47:38,550 So that means, assume you install phone booth in all the 1125 00:47:38,550 --> 00:47:40,660 fixed landmarks, but there's a service [UNINTELLIGIBLE] down. 1126 00:47:40,660 --> 00:47:41,670 I didn't see. 1127 00:47:41,670 --> 00:47:42,540 I don't have a phone booth there. 1128 00:47:42,540 --> 00:47:45,550 But here, if I'm calling you, I know I'm [UNINTELLIGIBLE] 1129 00:47:45,550 --> 00:47:48,700 you, anyplace you are, I will cover, even though I haven't 1130 00:47:48,700 --> 00:47:50,960 anticipated that point. 1131 00:47:50,960 --> 00:47:55,840 Very low overhead, because I can decide to call you every 1132 00:47:55,840 --> 00:47:59,525 30 minutes or call you every 1 minute, if I really care or 1133 00:47:59,525 --> 00:48:03,080 worry about you, and I can control the overhead that I'm 1134 00:48:03,080 --> 00:48:05,410 looking at, basically. 1135 00:48:05,410 --> 00:48:07,685 The problem is its approximate position. 1136 00:48:07,685 --> 00:48:12,610 Because if you spend 5 hours at MIT and 30 seconds at 1137 00:48:12,610 --> 00:48:14,500 Harvard, I will never know that you went to Harvard, 1138 00:48:14,500 --> 00:48:15,790 because I might have not called you while you were 1139 00:48:15,790 --> 00:48:16,470 there at Harvard. 1140 00:48:16,470 --> 00:48:19,990 You might think it's boring and come back, and I never 1141 00:48:19,990 --> 00:48:21,140 knew that you did that. 1142 00:48:21,140 --> 00:48:23,930 And also, the other thing is, I don't know exactly how many 1143 00:48:23,930 --> 00:48:26,160 times you went to a place, because I might have called 1144 00:48:26,160 --> 00:48:30,880 you 10 times, and I found you in the museum of the science 1145 00:48:30,880 --> 00:48:33,940 10 times, but you might have gone there 20 times, and I 1146 00:48:33,940 --> 00:48:35,830 might have missed you under the times I didn't call, or 1147 00:48:35,830 --> 00:48:37,700 you might have gone there only five times and stayed for a 1148 00:48:37,700 --> 00:48:39,510 long time, and I might have called multiple times. 1149 00:48:39,510 --> 00:48:42,500 I don't know that, and I don't know the count of times you 1150 00:48:42,500 --> 00:48:43,750 are gone, which is hard to know. 1151 00:48:46,140 --> 00:48:50,810 So the main thing is there might be things that are not 1152 00:48:50,810 --> 00:48:53,340 that statistically significant that you might miss. 1153 00:48:53,340 --> 00:48:56,380 And if you care about that, it's not a good bet. 1154 00:48:56,380 --> 00:48:59,090 But most of the time we care about the really statistically 1155 00:48:59,090 --> 00:49:03,250 significant things, so this is a really good method. 1156 00:49:03,250 --> 00:49:08,050 The other part is perfect accuracy, because every time 1157 00:49:08,050 --> 00:49:11,920 you go somewhere, you have to call me, and I know how many 1158 00:49:11,920 --> 00:49:14,030 times you went to the museum of science. 1159 00:49:14,030 --> 00:49:16,335 I know exactly the time you enter and exited a 1160 00:49:16,335 --> 00:49:16,900 [UNINTELLIGIBLE]. 1161 00:49:16,900 --> 00:49:19,590 I know all of those things [INAUDIBLE]. 1162 00:49:19,590 --> 00:49:21,780 The problem of that is kind of low granularity. 1163 00:49:21,780 --> 00:49:24,590 I can't put phone booth at every corner, and if you had 1164 00:49:24,590 --> 00:49:25,930 called me at every corner, it's going 1165 00:49:25,930 --> 00:49:27,270 to be way too expensive. 1166 00:49:27,270 --> 00:49:29,195 So I have low granularity, very good information, and 1167 00:49:29,195 --> 00:49:30,670 also high overhead. 1168 00:49:30,670 --> 00:49:32,980 Because if you're going in and out a lot to a building, you 1169 00:49:32,980 --> 00:49:34,550 have to call, you have to stop and call, 1170 00:49:34,550 --> 00:49:35,390 it's very high overhead. 1171 00:49:35,390 --> 00:49:38,080 And there's not much of a way for me to control that. 1172 00:49:38,080 --> 00:49:39,140 And it's also high touch. 1173 00:49:39,140 --> 00:49:41,000 That means I have to build all that infrastructure, I have to 1174 00:49:41,000 --> 00:49:44,830 go and modify your application, basically, in 1175 00:49:44,830 --> 00:49:45,740 something [UNINTELLIGIBLE] 1176 00:49:45,740 --> 00:49:48,630 compile time application, get modified to basically have all 1177 00:49:48,630 --> 00:49:49,490 these [UNINTELLIGIBLE] 1178 00:49:49,490 --> 00:49:50,740 installed into the application. 1179 00:49:52,900 --> 00:49:56,490 So you can look a lot of different types of events. 1180 00:49:56,490 --> 00:50:00,320 So in Intel, if you look at core performance counters, 1181 00:50:00,320 --> 00:50:02,420 there are hundreds of performance counters. 1182 00:50:02,420 --> 00:50:05,870 And some of these performance counters have no sense 1183 00:50:05,870 --> 00:50:06,350 whatsoever. 1184 00:50:06,350 --> 00:50:09,120 So for example, Intel has this counter called 1185 00:50:09,120 --> 00:50:11,370 number of bogus branches. 1186 00:50:11,370 --> 00:50:12,650 What's a bogus branch? 1187 00:50:12,650 --> 00:50:14,970 I mean, why should Intel be doing anything bogus? 1188 00:50:14,970 --> 00:50:16,920 It's some [UNINTELLIGIBLE] 1189 00:50:16,920 --> 00:50:17,450 came out. 1190 00:50:17,450 --> 00:50:20,676 But there are some counters that are useful, so you're 1191 00:50:20,676 --> 00:50:21,790 sort of getting [UNINTELLIGIBLE] 1192 00:50:21,790 --> 00:50:24,380 by thousands of things available, we focus on things 1193 00:50:24,380 --> 00:50:25,220 that we care about. 1194 00:50:25,220 --> 00:50:28,120 Things like branch events, load store events, cache 1195 00:50:28,120 --> 00:50:29,820 meter, prefetchers. 1196 00:50:29,820 --> 00:50:32,000 Those are the important things, and some multicore 1197 00:50:32,000 --> 00:50:34,240 events that we can look at, and get some interesting 1198 00:50:34,240 --> 00:50:36,500 information. 1199 00:50:36,500 --> 00:50:40,730 And a lot of times, just by looking at numbers doesn't 1200 00:50:40,730 --> 00:50:42,262 make too much sense. 1201 00:50:42,262 --> 00:50:47,290 You know you had $5 billion, $300 million branches missed. 1202 00:50:47,290 --> 00:50:49,600 Is it a good high or low? 1203 00:50:49,600 --> 00:50:51,020 You have no idea. 1204 00:50:51,020 --> 00:50:52,660 The right thing is, OK. 1205 00:50:52,660 --> 00:50:53,260 [UNINTELLIGIBLE] 1206 00:50:53,260 --> 00:50:55,920 to number of instruction executed how much it would be. 1207 00:50:55,920 --> 00:50:58,750 The most important numbers are ratios. 1208 00:50:58,750 --> 00:51:01,650 So from the branches executed, how many were missed? 1209 00:51:01,650 --> 00:51:03,710 That's a lot more important than, you have 5 billion 1210 00:51:03,710 --> 00:51:05,140 branch misses. 1211 00:51:05,140 --> 00:51:06,420 That doesn't say anything to me. 1212 00:51:06,420 --> 00:51:08,540 So most of the time, you have to figure out the right 1213 00:51:08,540 --> 00:51:11,130 ratios, and what makes sense to look at those ratios. 1214 00:51:14,320 --> 00:51:18,110 So now what I want to do is to go through a couple of 1215 00:51:18,110 --> 00:51:22,310 examples and show what I can see in these 1216 00:51:22,310 --> 00:51:24,450 different program behaviors. 1217 00:51:24,450 --> 00:51:26,090 Any questions so far? 1218 00:51:26,090 --> 00:51:31,500 Next lecture we are going to go through hands-on examples, 1219 00:51:31,500 --> 00:51:38,150 going through some of these profiling tools. 1220 00:51:38,150 --> 00:51:41,410 So it's kind of fun-- 1221 00:51:41,410 --> 00:51:44,280 what I did was finally discover what architecture is 1222 00:51:44,280 --> 00:51:48,170 doing, and what is the modern multicore is capable of doing. 1223 00:51:48,170 --> 00:51:50,180 In fact, some of these examples, so this set of 1224 00:51:50,180 --> 00:51:52,460 examples were done on a Core Two. 1225 00:51:52,460 --> 00:51:55,360 So some of the numbers might be very different in 1226 00:51:55,360 --> 00:51:58,600 [UNINTELLIGIBLE], because the architecture might do better 1227 00:51:58,600 --> 00:52:00,170 in some special prediction stuff. 1228 00:52:00,170 --> 00:52:00,600 I don't know. 1229 00:52:00,600 --> 00:52:02,160 It might be fun to move it it again. 1230 00:52:02,160 --> 00:52:05,030 So what I have is a program that I first created two 1231 00:52:05,030 --> 00:52:08,190 interesting arrays to access this array. 1232 00:52:08,190 --> 00:52:13,760 One array has numbers 1 to n, MAXA, is stored in this array. 1233 00:52:13,760 --> 00:52:18,660 This has 1 to MAXA in random order stored in this array. 1234 00:52:18,660 --> 00:52:24,490 So when I use this as a way to access the main array, and if 1235 00:52:24,490 --> 00:52:26,880 I use this in [UNINTELLIGIBLE] 1236 00:52:26,880 --> 00:52:29,140 0 to n, I will actually access A [UNINTELLIGIBLE] 1237 00:52:29,140 --> 00:52:29,570 0. 1238 00:52:29,570 --> 00:52:31,350 It's almost like saying AI. 1239 00:52:31,350 --> 00:52:32,270 But I use this here. 1240 00:52:32,270 --> 00:52:35,542 But if I use this one, I'm doing random [UNINTELLIGIBLE]. 1241 00:52:35,542 --> 00:52:38,260 Well, the first program I'm doing is just nothing but 1242 00:52:38,260 --> 00:52:40,120 going through this one. 1243 00:52:40,120 --> 00:52:42,390 I will have autoloop that will go through many, many times, 1244 00:52:42,390 --> 00:52:43,460 so I can actually [UNINTELLIGIBLE] 1245 00:52:43,460 --> 00:52:44,140 time. 1246 00:52:44,140 --> 00:52:45,280 So I just go through-- 1247 00:52:45,280 --> 00:52:47,930 just nothing but trying to just go through that. 1248 00:52:47,930 --> 00:52:50,650 Second thing I did, I just put a condition. 1249 00:52:50,650 --> 00:52:53,120 Say, j is than MAXA half. 1250 00:52:53,120 --> 00:52:56,920 That means halfway through the program, I will go 1251 00:52:56,920 --> 00:52:57,550 and update this one. 1252 00:52:57,550 --> 00:53:01,150 Another half, I'm not doing anything. 1253 00:53:01,150 --> 00:53:02,770 Then I said, OK, look. 1254 00:53:02,770 --> 00:53:05,880 I divide by every fourth [UNINTELLIGIBLE]. 1255 00:53:05,880 --> 00:53:09,920 So what I is j and with 03 in there. 1256 00:53:09,920 --> 00:53:12,660 That means 3 means [? 2b ?] 1257 00:53:12,660 --> 00:53:13,600 in the 1 1. 1258 00:53:13,600 --> 00:53:14,540 So we study 0. 1259 00:53:14,540 --> 00:53:16,600 That means last two [UNINTELLIGIBLE] has to be 0. 1260 00:53:16,600 --> 00:53:19,680 To form in delta every fourth element, I will update. 1261 00:53:19,680 --> 00:53:20,930 Otherwise I won't. 1262 00:53:22,710 --> 00:53:30,320 And the next thing I did was I updated-- 1263 00:53:30,320 --> 00:53:30,750 OK. 1264 00:53:30,750 --> 00:53:31,840 Let me ask you a question. 1265 00:53:31,840 --> 00:53:34,240 So this program, the output-- 1266 00:53:34,240 --> 00:53:38,560 that means what happened to A-- 1267 00:53:38,560 --> 00:53:41,200 the output of A is equivalent to some other program, one of 1268 00:53:41,200 --> 00:53:42,120 these three programs. 1269 00:53:42,120 --> 00:53:43,410 Which one is the [? outward ?] 1270 00:53:43,410 --> 00:53:44,660 equivalent? 1271 00:53:48,640 --> 00:53:49,890 See what you can make up. 1272 00:53:52,220 --> 00:53:56,290 So first three programs or each will have separate 1273 00:53:56,290 --> 00:53:57,260 different outputs. 1274 00:53:57,260 --> 00:54:01,860 A at A would be different if you run it. 1275 00:54:01,860 --> 00:54:04,960 Except the fourth program will produce the exact same results 1276 00:54:04,960 --> 00:54:07,380 as some other program, on the first three programs. 1277 00:54:07,380 --> 00:54:08,630 Which one? 1278 00:54:16,518 --> 00:54:18,040 Wake up, wake up, wake up! 1279 00:54:22,787 --> 00:54:24,480 I hear something. 1280 00:54:24,480 --> 00:54:25,960 I hear some-- 1281 00:54:25,960 --> 00:54:26,650 second one. 1282 00:54:26,650 --> 00:54:28,790 Somebody said the second one. 1283 00:54:28,790 --> 00:54:32,360 How many people agree with that? 1284 00:54:32,360 --> 00:54:33,610 How many people disagree? 1285 00:54:36,660 --> 00:54:38,060 OK. 1286 00:54:38,060 --> 00:54:39,515 This is not a statistical example. 1287 00:54:39,515 --> 00:54:42,090 It can be exact. 1288 00:54:42,090 --> 00:54:43,385 Why say the second one? 1289 00:54:43,385 --> 00:54:44,660 Second one is right. 1290 00:54:44,660 --> 00:54:48,500 So what that means is, in fact, instead of a j, I'm 1291 00:54:48,500 --> 00:54:49,740 using inc j. 1292 00:54:49,740 --> 00:54:54,580 Inc j is exactly j, because it goes, inc 0 is 0, inc 1 is 1, 1293 00:54:54,580 --> 00:54:55,480 inc 2 is 2. 1294 00:54:55,480 --> 00:54:56,640 I just created the same [UNINTELLIGIBLE] 1295 00:54:56,640 --> 00:54:57,620 array in here. 1296 00:54:57,620 --> 00:55:01,506 This is something to get that. 1297 00:55:01,506 --> 00:55:03,680 And finally, I am doing the [UNINTELLIGIBLE] 1298 00:55:03,680 --> 00:55:06,260 A basically from x. 1299 00:55:06,260 --> 00:55:09,150 So what that means, I'm using the same amount of data will 1300 00:55:09,150 --> 00:55:12,000 get updated, but it's very different order in there. 1301 00:55:12,000 --> 00:55:15,920 So now, before I go to the next slide-- 1302 00:55:15,920 --> 00:55:18,360 question? 1303 00:55:18,360 --> 00:55:22,518 AUDIENCE: Maybe you said this, but what is the A array? 1304 00:55:22,518 --> 00:55:23,590 SAMAN AMARASINGHE: Yeah, it's some numbers. 1305 00:55:23,590 --> 00:55:25,860 Some random set of numbers. 1306 00:55:25,860 --> 00:55:26,770 I don't care about that. 1307 00:55:26,770 --> 00:55:28,390 I just updated this one. 1308 00:55:28,390 --> 00:55:31,930 So in here, which program do we think run the faster? 1309 00:55:36,980 --> 00:55:39,070 So let me ask you this-- 1310 00:55:39,070 --> 00:55:44,710 which program updated the least amount of updates for A? 1311 00:55:54,380 --> 00:55:56,040 Third one, yeah, because it's only doing 1312 00:55:56,040 --> 00:55:57,900 every fourth element. 1313 00:55:57,900 --> 00:56:00,460 Which did the most? 1314 00:56:00,460 --> 00:56:03,440 First one is doing every element, and everybody else 1315 00:56:03,440 --> 00:56:05,280 will do probably half of the updates. 1316 00:56:05,280 --> 00:56:08,170 So knowing that, which program will run faster? 1317 00:56:11,120 --> 00:56:13,260 First one is updating everything. 1318 00:56:13,260 --> 00:56:14,440 [UNINTELLIGIBLE], OK. 1319 00:56:14,440 --> 00:56:14,940 [UNINTELLIGIBLE] 1320 00:56:14,940 --> 00:56:16,230 plus 1, OK. 1321 00:56:16,230 --> 00:56:17,480 Which program will run slow? 1322 00:56:21,220 --> 00:56:22,143 Last one. 1323 00:56:22,143 --> 00:56:23,020 You guys are onto something. 1324 00:56:23,020 --> 00:56:25,680 You have seen my slide, you actually know what the 1325 00:56:25,680 --> 00:56:26,910 hell is going on. 1326 00:56:26,910 --> 00:56:30,460 So what I show here is, you can't read this one, but I 1327 00:56:30,460 --> 00:56:30,880 [UNINTELLIGIBLE] 1328 00:56:30,880 --> 00:56:32,550 on time, and then I create a ratio. 1329 00:56:32,550 --> 00:56:34,320 So if you look at the ratio-- 1330 00:56:34,320 --> 00:56:34,890 so aha. 1331 00:56:34,890 --> 00:56:38,770 The first program, I always normalize to that, runs 1, and 1332 00:56:38,770 --> 00:56:41,760 that runs the fastest, if you tried, and the last program 1333 00:56:41,760 --> 00:56:43,620 ran the slowest. 1334 00:56:43,620 --> 00:56:52,830 And the interesting thing is, the [UNINTELLIGIBLE], the 1335 00:56:52,830 --> 00:56:57,410 other, the two, the second and fourth one, that wrote about 1336 00:56:57,410 --> 00:56:59,990 half [UNINTELLIGIBLE], basically ran all of the same. 1337 00:56:59,990 --> 00:57:03,190 But the one that wrote every fourth element was actually 1338 00:57:03,190 --> 00:57:04,620 slower than others. 1339 00:57:04,620 --> 00:57:06,810 Why is it going on? 1340 00:57:06,810 --> 00:57:10,790 So what you can do is, first look at how many instructions 1341 00:57:10,790 --> 00:57:14,430 would it add up, and how many instruction were executed? 1342 00:57:14,430 --> 00:57:18,710 So if you look at that, what you see is, the first one 1343 00:57:18,710 --> 00:57:20,820 executes the least amount of instructions, because it 1344 00:57:20,820 --> 00:57:21,980 doesn't have to do all this [UNINTELLIGIBLE] 1345 00:57:21,980 --> 00:57:24,320 when it's updating. 1346 00:57:24,320 --> 00:57:30,350 The interesting thing is, the one that wrote half has the 1347 00:57:30,350 --> 00:57:31,930 most amount of instructions. 1348 00:57:31,930 --> 00:57:36,510 One that wrote only 1/4 has middle amount of instructions. 1349 00:57:36,510 --> 00:57:38,130 That kind of makes sense in here. 1350 00:57:38,130 --> 00:57:41,450 But this still doesn't explain the slowdown. 1351 00:57:41,450 --> 00:57:47,330 That means if this is the only case, the divide by 4 should 1352 00:57:47,330 --> 00:57:49,660 run faster than the other three. 1353 00:57:49,660 --> 00:57:51,120 That doesn't make sense. 1354 00:57:51,120 --> 00:57:53,320 And then what you look at, look at [UNINTELLIGIBLE] 1355 00:57:53,320 --> 00:57:56,650 branches in there, and if you look at the branches, 1356 00:57:56,650 --> 00:57:58,050 what you see is-- 1357 00:57:58,050 --> 00:58:00,230 Let me skip this one, actually. 1358 00:58:00,230 --> 00:58:01,210 I want to get this one. 1359 00:58:01,210 --> 00:58:03,870 This is called clocks per instruction. 1360 00:58:03,870 --> 00:58:07,900 What this is is, if the instructions ran slow, these 1361 00:58:07,900 --> 00:58:11,110 instructions will take a lot more clock cycles. 1362 00:58:11,110 --> 00:58:13,590 That means the instructions are not doing well. 1363 00:58:13,590 --> 00:58:15,210 That means they are stalling, they are 1364 00:58:15,210 --> 00:58:16,560 taking a lot more time. 1365 00:58:16,560 --> 00:58:18,720 Normally, we should be able to run [UNINTELLIGIBLE] 1366 00:58:18,720 --> 00:58:20,340 about four instruction [UNINTELLIGIBLE] cycles should 1367 00:58:20,340 --> 00:58:25,380 get a CPA of 0.25, but that's very rare, and if you look at 1368 00:58:25,380 --> 00:58:29,440 that, if I normalize, I get 1 in here normalized. 1369 00:58:29,440 --> 00:58:33,920 And then what you see is second and fourth test, almost 1370 00:58:33,920 --> 00:58:39,760 the same CPI, and the third has worse, and the last one is 1371 00:58:39,760 --> 00:58:40,620 even worse. 1372 00:58:40,620 --> 00:58:41,730 So why is that? 1373 00:58:41,730 --> 00:58:44,780 Why are those instructions doing bad? 1374 00:58:44,780 --> 00:58:46,980 And then you can go look at that misprediction 1375 00:58:46,980 --> 00:58:49,750 instruction, and this is exactly what happens. 1376 00:58:49,750 --> 00:58:51,030 Divide by 4. 1377 00:58:51,030 --> 00:58:52,210 This is a condition. 1378 00:58:52,210 --> 00:58:55,900 The branch predictor couldn't predict it that well, because 1379 00:58:55,900 --> 00:58:56,390 [UNINTELLIGIBLE] 1380 00:58:56,390 --> 00:58:58,030 predicted in the first three, and it's [UNINTELLIGIBLE] 1381 00:58:58,030 --> 00:59:00,380 predicted, OK, after three, next one is also going 1 1382 00:59:00,380 --> 00:59:01,920 direction, oops, it's in another direction. 1383 00:59:01,920 --> 00:59:03,670 OK, going first three predictions, going one 1384 00:59:03,670 --> 00:59:04,890 direction, knowing that information. 1385 00:59:04,890 --> 00:59:08,250 So divide by 4, looks really bad, then predict 3. 1386 00:59:08,250 --> 00:59:10,740 And then the random one was just completely off. 1387 00:59:10,740 --> 00:59:12,880 Random one, basically predictions of three 1388 00:59:12,880 --> 00:59:13,470 [INAUDIBLE]. 1389 00:59:13,470 --> 00:59:14,530 That's why this is. 1390 00:59:14,530 --> 00:59:18,480 So in fact, if you just want to calculate the run time, if 1391 00:59:18,480 --> 00:59:23,920 you think every misprediction cost you 21 instructions, you 1392 00:59:23,920 --> 00:59:26,130 kind of get the exact total first. 1393 00:59:26,130 --> 00:59:26,740 You can't help it. 1394 00:59:26,740 --> 00:59:30,300 So after you meet the instruction, take one cycle, 1395 00:59:30,300 --> 00:59:32,660 and everybody misprediction instruction, take 21 cycles, 1396 00:59:32,660 --> 00:59:34,170 and you just calculate that. 1397 00:59:34,170 --> 00:59:35,950 And this [UNINTELLIGIBLE], I just came, took it out of a 1398 00:59:35,950 --> 00:59:38,370 hat, and then voila, I get a number that's very 1399 00:59:38,370 --> 00:59:39,290 close to run time. 1400 00:59:39,290 --> 00:59:44,460 So what that means is, I can kind of understand what the 1401 00:59:44,460 --> 00:59:46,700 behavior, this explains what's going on. 1402 00:59:46,700 --> 00:59:48,830 So I can have a model in my head for what's going on. 1403 00:59:48,830 --> 00:59:51,210 So it's all about misprediction in here. 1404 00:59:51,210 --> 00:59:52,840 I did two other experiments in here. 1405 00:59:52,840 --> 00:59:54,770 I have the slide sitting here, but I don't think I have too 1406 00:59:54,770 --> 00:59:55,840 much time to go. 1407 00:59:55,840 --> 01:00:00,240 I want to go to the numbers in there. 1408 01:00:00,240 --> 01:00:01,200 Are you ready, [UNINTELLIGIBLE]? 1409 01:00:01,200 --> 01:00:02,010 OK. 1410 01:00:02,010 --> 01:00:05,530 Why don't you go put your laptop in. 1411 01:00:05,530 --> 01:00:06,957 You have the dongle? 1412 01:00:06,957 --> 01:00:08,207 AUDIENCE: [INAUDIBLE] 1413 01:00:10,440 --> 01:00:12,820 SAMAN AMARASINGHE: OK. 1414 01:00:12,820 --> 01:00:14,310 So voila. 1415 01:00:14,310 --> 01:00:19,100 So something you guys waited all this time, and were up all 1416 01:00:19,100 --> 01:00:21,030 these days developing. 1417 01:00:21,030 --> 01:00:23,830 So let's see how everybody did. 1418 01:00:23,830 --> 01:00:25,630 OK, without further ado. 1419 01:00:25,630 --> 01:00:26,860 I haven't even seen these numbers. 1420 01:00:26,860 --> 01:00:29,260 OK, what happened? 1421 01:00:29,260 --> 01:00:30,040 He's trying to-- 1422 01:00:30,040 --> 01:00:30,520 OK, good. 1423 01:00:30,520 --> 01:00:32,730 This is [UNINTELLIGIBLE] 1424 01:00:32,730 --> 01:00:33,606 screen, OK. 1425 01:00:33,606 --> 01:00:34,518 Do you have it? 1426 01:00:34,518 --> 01:00:35,430 OK. 1427 01:00:35,430 --> 01:00:38,334 AUDIENCE: OK. 1428 01:00:38,334 --> 01:00:40,660 Something to keep in mind is, we haven't actually 1429 01:00:40,660 --> 01:00:42,735 investigated what's causing people to have build failures 1430 01:00:42,735 --> 01:00:43,760 or crashes yet. 1431 01:00:43,760 --> 01:00:45,435 So it could be your fault, it could be ours. 1432 01:00:45,435 --> 01:00:47,240 We'll figure that out. 1433 01:00:47,240 --> 01:00:49,380 SAMAN AMARASINGHE: So if smaller, better or higher, 1434 01:00:49,380 --> 01:00:50,000 better, what's-- 1435 01:00:50,000 --> 01:00:50,376 AUDIENCE: [INAUDIBLE] better. 1436 01:00:50,376 --> 01:00:50,740 This is run time. 1437 01:00:50,740 --> 01:00:51,670 That 1438 01:00:51,670 --> 01:00:52,920 SAMAN AMARASINGHE: This is really good! 1439 01:00:55,850 --> 01:00:56,990 This is actually-- 1440 01:00:56,990 --> 01:00:57,980 AUDIENCE: [INAUDIBLE] 1441 01:00:57,980 --> 01:01:02,650 The baseline is the flat code that you see in the middle. 1442 01:01:02,650 --> 01:01:03,690 SAMAN AMARASINGHE: So this is really good. 1443 01:01:03,690 --> 01:01:07,210 That means there were kind of two groups. 1444 01:01:07,210 --> 01:01:09,240 Most of the people got everything right, and they're 1445 01:01:09,240 --> 01:01:10,150 now the bottom. 1446 01:01:10,150 --> 01:01:13,600 And there's another group that missed something, and it's in 1447 01:01:13,600 --> 01:01:15,126 the second camp. 1448 01:01:15,126 --> 01:01:17,680 AUDIENCE: I think some of these projects are from people 1449 01:01:17,680 --> 01:01:19,630 who dropped the class at some point. 1450 01:01:19,630 --> 01:01:21,440 So there's probably [INAUDIBLE] 1451 01:01:21,440 --> 01:01:24,590 might not be actually representative of the class. 1452 01:01:24,590 --> 01:01:26,420 SAMAN AMARASINGHE: So last year, we got like almost 1453 01:01:26,420 --> 01:01:27,980 exponential curve, actually. 1454 01:01:27,980 --> 01:01:29,600 Last year, what happened was there were a couple of people 1455 01:01:29,600 --> 01:01:32,150 at the bottom, and everybody worked at the top, basically. 1456 01:01:32,150 --> 01:01:32,900 Which is actually good. 1457 01:01:32,900 --> 01:01:34,490 So you guys actually figured out what's going on. 1458 01:01:34,490 --> 01:01:38,290 AUDIENCE: So next one is even more interesting. 1459 01:01:38,290 --> 01:01:38,713 SAMAN AMARASINGHE: Wow! 1460 01:01:38,713 --> 01:01:39,560 AUDIENCE: So this is [INAUDIBLE] 1461 01:01:39,560 --> 01:01:40,780 that we gave you guys. 1462 01:01:40,780 --> 01:01:43,700 Most of you optimized it down to 0 seconds. 1463 01:01:43,700 --> 01:01:47,290 So I coded in a harder test case, and here's 1464 01:01:47,290 --> 01:01:48,430 the results on that. 1465 01:01:48,430 --> 01:01:51,195 Still pretty good. 1466 01:01:51,195 --> 01:01:53,144 [INAUDIBLE] 1467 01:01:53,144 --> 01:01:55,640 There's some very specific optimizations [INAUDIBLE]. 1468 01:01:55,640 --> 01:01:56,370 Yeah. 1469 01:01:56,370 --> 01:01:57,980 SAMAN AMARASINGHE: So exactly. 1470 01:01:57,980 --> 01:02:00,140 There are people who are climbing up, I don't know 1471 01:02:00,140 --> 01:02:02,890 exactly what, but it's very clear that [UNINTELLIGIBLE], 1472 01:02:02,890 --> 01:02:04,910 probably data representation tag that 1473 01:02:04,910 --> 01:02:06,750 people missed in here. 1474 01:02:06,750 --> 01:02:08,130 Which is really now interesting. 1475 01:02:08,130 --> 01:02:08,500 Yes. 1476 01:02:08,500 --> 01:02:11,490 AUDIENCE: And for Pentominos, this is very incomplete, 1477 01:02:11,490 --> 01:02:14,330 because the set of tests that we wanted to run are still 1478 01:02:14,330 --> 01:02:15,370 currently running. 1479 01:02:15,370 --> 01:02:18,190 So I just picked a random test case, just to give everybody 1480 01:02:18,190 --> 01:02:22,200 an idea of how everybody did, and set the time out very low 1481 01:02:22,200 --> 01:02:24,560 so that I could actually finish it during this lecture. 1482 01:02:24,560 --> 01:02:26,810 And what's [INAUDIBLE]? 1483 01:02:26,810 --> 01:02:27,110 Yes. 1484 01:02:27,110 --> 01:02:29,230 Somebody solved it instantaneously, and it 1485 01:02:29,230 --> 01:02:31,810 searches for the first 5,000 solutions to some random 1486 01:02:31,810 --> 01:02:33,520 puzzle that I pulled out. 1487 01:02:33,520 --> 01:02:37,146 SAMAN AMARASINGHE: So [UNINTELLIGIBLE]? 1488 01:02:37,146 --> 01:02:38,700 AUDIENCE: That I haven't verified yet. 1489 01:02:38,700 --> 01:02:39,070 SAMAN AMARASINGHE: Aha. 1490 01:02:39,070 --> 01:02:41,890 So it might be, there might be just the scratchings, so we 1491 01:02:41,890 --> 01:02:46,790 don't know that, and then that could be correct answer, but-- 1492 01:02:46,790 --> 01:02:48,750 AUDIENCE: Which one is the baseline? 1493 01:02:48,750 --> 01:02:53,180 The baseline is actually the 90-second mark. 1494 01:02:53,180 --> 01:02:55,610 [INAUDIBLE] 1495 01:02:55,610 --> 01:02:56,570 SAMAN AMARASINGHE: [UNINTELLIGIBLE] 1496 01:02:56,570 --> 01:02:57,715 the baseline? 1497 01:02:57,715 --> 01:02:58,170 AUDIENCE: No. 1498 01:02:58,170 --> 01:03:00,190 10 seconds and above are either timeout or [INAUDIBLE]. 1499 01:03:00,190 --> 01:03:02,248 SAMAN AMARASINGHE: Oh, they're probably mostly timeouts. 1500 01:03:02,248 --> 01:03:05,240 AUDIENCE: [INAUDIBLE] 1501 01:03:05,240 --> 01:03:06,490 SAMAN AMARASINGHE: Yes. 1502 01:03:09,180 --> 01:03:09,840 So there you are. 1503 01:03:09,840 --> 01:03:12,200 This one, people would have some work to do, I guess, to 1504 01:03:12,200 --> 01:03:13,860 get the performance down. 1505 01:03:13,860 --> 01:03:15,770 But this is pretty good. 1506 01:03:15,770 --> 01:03:18,210 This is not like [UNINTELLIGIBLE]. 1507 01:03:18,210 --> 01:03:20,320 I mean, last couple of years, we had people, 1508 01:03:20,320 --> 01:03:21,000 [UNINTELLIGIBLE] 1509 01:03:21,000 --> 01:03:22,580 someone who was 1,000 [UNINTELLIGIBLE], you had to 1510 01:03:22,580 --> 01:03:25,460 plot it in a log thing. 1511 01:03:25,460 --> 01:03:29,160 So in fact, if I do a square root within the grid, this is 1512 01:03:29,160 --> 01:03:30,980 pretty nice. 1513 01:03:30,980 --> 01:03:32,530 So [UNINTELLIGIBLE] 1514 01:03:32,530 --> 01:03:33,430 is not that bad. 1515 01:03:33,430 --> 01:03:34,860 So this is great. 1516 01:03:34,860 --> 01:03:36,060 This is where you get [UNINTELLIGIBLE] 1517 01:03:36,060 --> 01:03:40,610 performance wise And if you are in the bottom, you can go 1518 01:03:40,610 --> 01:03:41,450 have a beer. 1519 01:03:41,450 --> 01:03:45,340 Otherwise, you might have to go back and figure out what 1520 01:03:45,340 --> 01:03:46,330 you missed. 1521 01:03:46,330 --> 01:03:48,770 AUDIENCE: [INAUDIBLE] 1522 01:03:48,770 --> 01:03:49,590 SAMAN AMARASINGHE: Oh, if you're [UNINTELLIGIBLE]. 1523 01:03:49,590 --> 01:03:51,000 Yes, exactly yes. 1524 01:03:51,000 --> 01:03:52,250 I should say that.