1 00:00:00,570 --> 00:00:03,000 The following content is provided under a Creative 2 00:00:03,000 --> 00:00:04,410 Commons license. 3 00:00:04,410 --> 00:00:07,450 Your support will help MIT OpenCourseWare continue to 4 00:00:07,450 --> 00:00:11,100 offer high quality educational resources for free. 5 00:00:11,100 --> 00:00:14,000 To make a donation or view additional materials from 6 00:00:14,000 --> 00:00:17,930 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,930 --> 00:00:19,180 ocw.mit.edu. 8 00:00:23,290 --> 00:00:28,020 PROFESSOR: So hopefully by the end of the class, we will show 9 00:00:28,020 --> 00:00:31,350 you a histogram for the quiz. 10 00:00:31,350 --> 00:00:32,930 We are very happy. 11 00:00:32,930 --> 00:00:36,450 You guys did really well, so we feel like, actually, you 12 00:00:36,450 --> 00:00:38,290 learned something in the quiz, so it makes us happy. 13 00:00:41,820 --> 00:00:43,730 We're hoping to have the histogram ready 14 00:00:43,730 --> 00:00:45,160 by now, but we don't. 15 00:00:45,160 --> 00:00:47,350 By end of the class, hopefully we take a break in a middle, 16 00:00:47,350 --> 00:00:48,660 and go through rest of that. 17 00:00:48,660 --> 00:00:51,610 So we are going a little bit off from regular programming. 18 00:00:51,610 --> 00:00:53,920 If you look at the class schedule, we are going to have 19 00:00:53,920 --> 00:00:57,540 a guest lecture today, but I am the guest lecture, I guess. 20 00:00:57,540 --> 00:01:03,630 So I'm going to talk about what compilers can and cannot 21 00:01:03,630 --> 00:01:07,740 do because, as you went on last couple of projects, you 22 00:01:07,740 --> 00:01:10,920 tried hard to do weird things out of the compiler, create 23 00:01:10,920 --> 00:01:13,740 different piece of code, [UNINTELLIGIBLE] 24 00:01:13,740 --> 00:01:14,640 programs. 25 00:01:14,640 --> 00:01:19,430 And so I'm going to kind of, first, walk through some stuff 26 00:01:19,430 --> 00:01:23,110 that is very practical, what the current 27 00:01:23,110 --> 00:01:24,980 GCC do or don't do. 28 00:01:24,980 --> 00:01:27,550 OK, so we'll go through some stuff, which is interesting, 29 00:01:27,550 --> 00:01:29,570 and then I'm going to-- 30 00:01:29,570 --> 00:01:31,480 let's get the outline first. 31 00:01:31,480 --> 00:01:32,730 OK, this doesn't work. 32 00:01:40,020 --> 00:01:42,640 OK. 33 00:01:42,640 --> 00:01:46,710 Then we will talk a little more about where the normal 34 00:01:46,710 --> 00:01:50,180 optimizations happen, what are all the possibilities, just 35 00:01:50,180 --> 00:01:51,830 very quickly, to give you a feel. 36 00:01:51,830 --> 00:01:54,250 It's not all static type compilation. 37 00:01:54,250 --> 00:01:55,830 And then go through two things. 38 00:01:55,830 --> 00:01:58,090 One is data-flow analysis and optimization. 39 00:01:58,090 --> 00:02:00,380 That's a big part what compilers does. 40 00:02:00,380 --> 00:02:02,660 And then also instructions scheduling. 41 00:02:02,660 --> 00:02:04,840 Instruction scheduling might not be that important on a 42 00:02:04,840 --> 00:02:05,860 superscalar. 43 00:02:05,860 --> 00:02:08,039 but it's always good to know what it does because there are 44 00:02:08,039 --> 00:02:10,699 some cases where you had to do it in the compiler the 45 00:02:10,699 --> 00:02:15,880 hardware cannot do, so we'll look at some of those cases. 46 00:02:15,880 --> 00:02:23,270 So the first thing that we found a lot of you guys did is 47 00:02:23,270 --> 00:02:29,700 you try to inline a lot of code in your projects, and in 48 00:02:29,700 --> 00:02:32,400 fact, one way to do that is have you do a macro. 49 00:02:32,400 --> 00:02:36,500 A macro basically make sure that definitely get in line, 50 00:02:36,500 --> 00:02:38,450 but macros are ugly. 51 00:02:38,450 --> 00:02:41,120 There are many things you can't really do with a macro. 52 00:02:41,120 --> 00:02:45,460 The other thing is, we can actually do simple max 53 00:02:45,460 --> 00:02:47,390 calculation using function. 54 00:02:47,390 --> 00:02:48,800 OK, starting function defined. 55 00:02:48,800 --> 00:02:52,690 And the first one is calling this one. 56 00:02:52,690 --> 00:02:54,580 Second one is calling this one. 57 00:02:54,580 --> 00:02:58,790 So what do you think this is going to do? 58 00:02:58,790 --> 00:03:02,680 How many people think that the code produced between this and 59 00:03:02,680 --> 00:03:04,045 this going to be drastically different? 60 00:03:07,530 --> 00:03:10,040 Code produced for this function versus this function 61 00:03:10,040 --> 00:03:11,290 going to be drastically different? 62 00:03:15,240 --> 00:03:17,180 Real different. 63 00:03:17,180 --> 00:03:18,700 For some people, [UNINTELLIGIBLE] 64 00:03:18,700 --> 00:03:19,810 going to be different. 65 00:03:19,810 --> 00:03:24,000 OK, why do you think it's different? 66 00:03:24,000 --> 00:03:25,796 GUEST SPEAKER: So the second function calls the first 67 00:03:25,796 --> 00:03:27,430 function, and the first [INAUDIBLE] 68 00:03:27,430 --> 00:03:28,410 calls the second one. 69 00:03:28,410 --> 00:03:29,660 PROFESSOR: Yes. 70 00:03:36,950 --> 00:03:41,360 AUDIENCE: So since you have the first one calling max1, 71 00:03:41,360 --> 00:03:44,790 which is a macro, the compiler just copied [UNINTELLIGIBLE] 72 00:03:44,790 --> 00:03:46,260 out of there. 73 00:03:46,260 --> 00:03:47,730 Whereas the second one-- 74 00:03:52,630 --> 00:03:54,980 it will try to actually optimize the-- 75 00:03:54,980 --> 00:04:00,540 Won't it try to optimize what is inside un64 [INAUDIBLE]? 76 00:04:00,540 --> 00:04:01,764 PROFESSOR: So what he thinks is that is just 77 00:04:01,764 --> 00:04:02,600 going to get copied. 78 00:04:02,600 --> 00:04:05,140 This will be still a function call, try and do some 79 00:04:05,140 --> 00:04:06,190 optimizing. 80 00:04:06,190 --> 00:04:10,720 In fact, what it will do is, the first function is going to 81 00:04:10,720 --> 00:04:11,970 get inline. 82 00:04:11,970 --> 00:04:14,570 Something interesting here, what you see is even though 83 00:04:14,570 --> 00:04:17,110 there's a condition here, there's no branch. 84 00:04:17,110 --> 00:04:20,760 They have the same old instruction that basically can 85 00:04:20,760 --> 00:04:22,610 be used to do a conditional [UNINTELLIGIBLE]. 86 00:04:22,610 --> 00:04:25,310 So instead of having a branch and having a pipeline 87 00:04:25,310 --> 00:04:28,180 [UNINTELLIGIBLE], this managed to convert this into nice, 88 00:04:28,180 --> 00:04:29,240 direct call. 89 00:04:29,240 --> 00:04:32,370 The interesting thing is the second one is also identical, 90 00:04:32,370 --> 00:04:34,870 so what it did was it said, hi, I know this function, I 91 00:04:34,870 --> 00:04:35,550 can inline. 92 00:04:35,550 --> 00:04:36,910 It got inline automatically. 93 00:04:36,910 --> 00:04:39,650 You didn't tell you to do, the compiler actually inlined it 94 00:04:39,650 --> 00:04:43,030 for you, and then did all the things that are necessary. 95 00:04:43,030 --> 00:04:45,620 So what that means is you don't have to write some of 96 00:04:45,620 --> 00:04:48,430 these ugly macros and hand inline. 97 00:04:48,430 --> 00:04:51,850 You can have nice functions in there, especially if it's in 98 00:04:51,850 --> 00:04:53,560 the same file, it can get inline. 99 00:04:53,560 --> 00:04:57,000 Of course, if you define it to different file, and this file 100 00:04:57,000 --> 00:04:58,670 doesn't have access to it, it won't. 101 00:04:58,670 --> 00:05:02,880 But if it is the same file, it'll get inline, so you get 102 00:05:02,880 --> 00:05:03,790 the same result. 103 00:05:03,790 --> 00:05:06,170 So you can still have this nice [UNINTELLIGIBLE]. 104 00:05:06,170 --> 00:05:10,950 You don't have to be with optimizations at that level. 105 00:05:10,950 --> 00:05:13,090 So another thing is we have this entire-- 106 00:05:13,090 --> 00:05:13,590 Question? 107 00:05:13,590 --> 00:05:15,569 AUDIENCE: On the previous slide, other than looking at 108 00:05:15,569 --> 00:05:18,034 the assembly code, how do we know when the 109 00:05:18,034 --> 00:05:20,500 compiler based this on? 110 00:05:20,500 --> 00:05:21,750 PROFESSOR: You've got the assembly code. 111 00:05:24,974 --> 00:05:25,468 Question? 112 00:05:25,468 --> 00:05:29,420 AUDIENCE: Why do you prefer static converge versus inline? 113 00:05:29,420 --> 00:05:33,070 PROFESSOR: So the reason I did static is basically, I want to 114 00:05:33,070 --> 00:05:36,470 make sure it's not visible outside the file, so if you 115 00:05:36,470 --> 00:05:39,230 don't make it static, what'll happen is everybody else had 116 00:05:39,230 --> 00:05:42,050 the visibility and then you kind of pollute the space with 117 00:05:42,050 --> 00:05:43,110 a lot of names. 118 00:05:43,110 --> 00:05:45,200 So if you give static, it's only within you. 119 00:05:45,200 --> 00:05:48,660 If I had inline I can't ask it to forcefully do that, but 120 00:05:48,660 --> 00:05:50,275 what I'm showing is you don't even have to say inline. 121 00:05:50,275 --> 00:05:51,165 It'll inline by itself. 122 00:05:51,165 --> 00:05:52,197 Question? 123 00:05:52,197 --> 00:05:53,447 AUDIENCE: [INAUDIBLE] 124 00:05:55,536 --> 00:05:57,921 this function to be inline without actually [INAUDIBLE]? 125 00:06:01,250 --> 00:06:03,950 PROFESSOR: Actually, gprof, what'll happen is the file 126 00:06:03,950 --> 00:06:05,355 will vanish from gprof, isn't it? 127 00:06:08,320 --> 00:06:10,495 The function will vanish because gprof will basically, 128 00:06:10,495 --> 00:06:11,980 if you have a function, you go look, you 129 00:06:11,980 --> 00:06:13,090 don't see the function. 130 00:06:13,090 --> 00:06:14,950 And it might give, even, a bad impression that OK, that 131 00:06:14,950 --> 00:06:18,115 function is not important, so that you had to be careful in 132 00:06:18,115 --> 00:06:20,165 that because when you look at gprof, you 133 00:06:20,165 --> 00:06:21,270 will see some files. 134 00:06:21,270 --> 00:06:24,160 The inline find some functions, inline function 135 00:06:24,160 --> 00:06:26,490 won't be there. 136 00:06:26,490 --> 00:06:28,790 Am I right, or is it doing anything 137 00:06:28,790 --> 00:06:29,760 interesting to the samples? 138 00:06:29,760 --> 00:06:30,510 No. 139 00:06:30,510 --> 00:06:31,690 [UNINTELLIGIBLE] 140 00:06:31,690 --> 00:06:33,690 vanish and oh, yeah, that function is not important, but 141 00:06:33,690 --> 00:06:36,990 in fact, it might be really important, but it got inlined. 142 00:06:36,990 --> 00:06:40,330 So that's one way to, a little bit, worry about. 143 00:06:40,330 --> 00:06:42,840 Does it make sense? 144 00:06:42,840 --> 00:06:44,700 OK. 145 00:06:44,700 --> 00:06:46,200 So we learned bithacks. 146 00:06:46,200 --> 00:06:47,270 So it was really fun. 147 00:06:47,270 --> 00:06:50,640 We are learning all these interesting bithacks, but the 148 00:06:50,640 --> 00:06:53,450 interesting thing is, in fact, GCC compiler also knows a lot 149 00:06:53,450 --> 00:06:56,470 of bithacks, so it's also a pretty smart compiler. 150 00:06:56,470 --> 00:06:58,895 So if you have something like this, what do you think the 151 00:06:58,895 --> 00:07:00,783 smd would be? 152 00:07:00,783 --> 00:07:02,196 AUDIENCE: [INAUDIBLE]. 153 00:07:02,196 --> 00:07:04,140 PROFESSOR: Yeah, it actually did this one. 154 00:07:04,140 --> 00:07:07,700 This is very interesting because what it did was it 155 00:07:07,700 --> 00:07:08,770 didn't do shift. 156 00:07:08,770 --> 00:07:14,460 It did this load effective address quad instruction, leaq 157 00:07:14,460 --> 00:07:15,170 instruction. 158 00:07:15,170 --> 00:07:18,800 What it does is multiply these two, and it's add. 159 00:07:18,800 --> 00:07:21,570 It does nothing in here to add, and then it can add a 160 00:07:21,570 --> 00:07:22,120 constanant. 161 00:07:22,120 --> 00:07:26,830 So this very complex address mode that is being used for 162 00:07:26,830 --> 00:07:28,370 completely different purpose. 163 00:07:28,370 --> 00:07:30,340 So this [UNINTELLIGIBLE] 164 00:07:30,340 --> 00:07:30,660 address. 165 00:07:30,660 --> 00:07:33,440 It's doing this multiplied by this, there's nothing to add, 166 00:07:33,440 --> 00:07:35,240 there's no offset to add, save it here. 167 00:07:39,190 --> 00:07:41,280 So it's actually doing this multiply-by. 168 00:07:41,280 --> 00:07:45,020 The nice thing is it doesn't affect any things like 169 00:07:45,020 --> 00:07:47,560 condition codes and stuff like that, so this is a 170 00:07:47,560 --> 00:07:49,100 fast thing to do. 171 00:07:49,100 --> 00:07:50,500 OK? 172 00:07:50,500 --> 00:07:51,310 Actually, that's interesting. 173 00:07:51,310 --> 00:07:52,060 How would [UNINTELLIGIBLE] 174 00:07:52,060 --> 00:07:53,960 43? 175 00:07:53,960 --> 00:07:55,810 That makes it a little bit more complicated. 176 00:07:55,810 --> 00:07:57,200 What do you think? 177 00:07:57,200 --> 00:07:58,725 How to multiply something by 43? 178 00:08:03,379 --> 00:08:04,980 Anybody want to take a guess [UNINTELLIGIBLE] 179 00:08:04,980 --> 00:08:06,070 multiply 43 [UNINTELLIGIBLE] 180 00:08:06,070 --> 00:08:07,870 mul43? 181 00:08:07,870 --> 00:08:08,820 AUDIENCE: [INAUDIBLE] 182 00:08:08,820 --> 00:08:13,570 multiple by 32, have that multiplied by-- 183 00:08:13,570 --> 00:08:15,820 PROFESSOR: So it did something interesting. 184 00:08:15,820 --> 00:08:17,350 So here's what it did. 185 00:08:17,350 --> 00:08:19,750 I want you guys to stare at it a little bit and see if you 186 00:08:19,750 --> 00:08:22,740 can even figure out what's going on here because this is 187 00:08:22,740 --> 00:08:23,990 kind of funky. 188 00:08:31,120 --> 00:08:35,090 So what happens after the first leaq? 189 00:08:35,090 --> 00:08:36,340 What's an rax? 190 00:08:43,789 --> 00:08:47,890 What's an rax after first leaq instruction? 191 00:08:47,890 --> 00:08:49,140 Anyone take a wild guess? 192 00:08:53,276 --> 00:08:54,230 AUDIENCE: Five? 193 00:08:54,230 --> 00:08:55,170 PROFESSOR: Five. 194 00:08:55,170 --> 00:08:55,790 Yes, exactly. 195 00:08:55,790 --> 00:08:58,920 What it does is it's 4a+a because this 196 00:08:58,920 --> 00:09:01,490 is a+a, 5a in here. 197 00:09:01,490 --> 00:09:03,640 And then after here, what happened? 198 00:09:13,320 --> 00:09:16,170 4 times this one plus this one. 199 00:09:21,090 --> 00:09:25,125 Yeah, it's 21 because 4 times this is 20, and you add the 200 00:09:25,125 --> 00:09:28,560 whole original rax 21 here, 21. 201 00:09:28,560 --> 00:09:32,600 And then you do this time again 42, and add 202 00:09:32,600 --> 00:09:34,280 another one, 43. 203 00:09:34,280 --> 00:09:37,190 So it did all this interesting three 204 00:09:37,190 --> 00:09:39,900 instructions to get to 43. 205 00:09:39,900 --> 00:09:41,380 And so this is fun. 206 00:09:41,380 --> 00:09:43,430 You can spend days giving different meaning to this and 207 00:09:43,430 --> 00:09:46,100 see what the compiler generates. 208 00:09:46,100 --> 00:09:50,310 And I notice at some point say, when do it give up? 209 00:09:50,310 --> 00:09:51,530 And it doesn't give up for a while. 210 00:09:51,530 --> 00:09:53,320 It start creating some crazy things. 211 00:09:53,320 --> 00:09:54,940 OK, try one more thing. 212 00:09:54,940 --> 00:09:57,500 OK, this has to be hard. 213 00:09:57,500 --> 00:09:59,180 255. 214 00:09:59,180 --> 00:10:00,370 How did it get 254? 215 00:10:00,370 --> 00:10:02,590 There's no easy way to do that. 216 00:10:02,590 --> 00:10:03,370 How did that to that? 217 00:10:03,370 --> 00:10:05,855 This is the very interesting thing in here, so I will show 218 00:10:05,855 --> 00:10:06,410 you this one. 219 00:10:06,410 --> 00:10:08,330 So here is that instructions you generated. 220 00:10:14,670 --> 00:10:20,420 So what it did was, it's doing here, getting it 2a, by doing 221 00:10:20,420 --> 00:10:24,080 leaq because since it didn't give a multiply, it just 222 00:10:24,080 --> 00:10:26,070 multiplied by 1, so it's just adding these two. 223 00:10:26,070 --> 00:10:28,310 You get 2a. 224 00:10:28,310 --> 00:10:35,900 And then it multiplied by 128 by doing a bitshift here, and 225 00:10:35,900 --> 00:10:37,020 then it got such a wonderful thing. 226 00:10:37,020 --> 00:10:39,950 Why did it do this instead of doing one instruction to 256? 227 00:10:43,528 --> 00:10:45,440 AUDIENCE: [INAUDIBLE]. 228 00:10:45,440 --> 00:10:47,910 PROFESSOR: Yeah, then it went and subtracted a 2a 229 00:10:47,910 --> 00:10:50,560 again, and got 254. 230 00:10:50,560 --> 00:10:53,345 So it went overshoot, subtract, and subject one 231 00:10:53,345 --> 00:10:53,980 more, so [UNINTELLIGIBLE] 232 00:10:53,980 --> 00:10:55,210 got calculated here. 233 00:10:55,210 --> 00:10:58,280 It does a really smart way, I don't know what complex logic 234 00:10:58,280 --> 00:11:01,710 is there, to basically figure out these combinations of 235 00:11:01,710 --> 00:11:04,620 instructions that can do multiplication. 236 00:11:04,620 --> 00:11:07,160 We planned on a bunch of things until we run into a 237 00:11:07,160 --> 00:11:07,840 couple of thousand. 238 00:11:07,840 --> 00:11:09,450 It was doing these weird patterns 239 00:11:09,450 --> 00:11:11,050 after a couple of thousand. 240 00:11:11,050 --> 00:11:15,480 Especially if it is to close to 2 to the power, 241 00:11:15,480 --> 00:11:16,790 it easily find it. 242 00:11:16,790 --> 00:11:18,910 Even primes, it managed to find. 243 00:11:18,910 --> 00:11:21,310 Actually, before we gave a prime, and it found it because 244 00:11:21,310 --> 00:11:23,440 it found the closest thing, and a couple of things 245 00:11:23,440 --> 00:11:26,140 [UNINTELLIGIBLE], and you can get to the prime. 246 00:11:26,140 --> 00:11:31,400 So it's kind of interesting how find goes in just 247 00:11:31,400 --> 00:11:32,650 [UNINTELLIGIBLE] multiplies. 248 00:11:34,375 --> 00:11:35,625 AUDIENCE: [INAUDIBLE]? 249 00:11:38,780 --> 00:11:40,880 PROFESSOR: So this is a very interesting question. 250 00:11:40,880 --> 00:11:44,300 So what it's doing is, it has realized somehow that doing a 251 00:11:44,300 --> 00:11:50,180 direct multiply by 254 is going to be slower, so the 252 00:11:50,180 --> 00:11:51,200 multiply instruction-- 253 00:11:51,200 --> 00:11:54,110 if you go, I think you can also look at multiply 254 00:11:54,110 --> 00:11:57,010 instruction, how many cycles it's going to take to come-- 255 00:11:57,010 --> 00:11:58,260 AUDIENCE: [INAUDIBLE]? 256 00:12:00,320 --> 00:12:04,470 PROFESSOR: Oh, because it needs to get this 2a again. 257 00:12:04,470 --> 00:12:07,470 So it's [UNINTELLIGIBLE], use the two 2a. 258 00:12:07,470 --> 00:12:11,340 So by calculating that, it kept it. 259 00:12:11,340 --> 00:12:13,080 I don't know, you might be right, because if it 260 00:12:13,080 --> 00:12:13,200 [UNINTELLIGIBLE] 261 00:12:13,200 --> 00:12:16,000 2a to 6n2, then there's no dependency, and right now, 262 00:12:16,000 --> 00:12:17,365 there's a dependent change here. 263 00:12:17,365 --> 00:12:18,615 AUDIENCE: [INAUDIBLE]? 264 00:12:22,680 --> 00:12:23,800 PROFESSOR: Twice, yes, something like that. 265 00:12:23,800 --> 00:12:25,700 Or calculate this separately. 266 00:12:25,700 --> 00:12:29,160 2*8 and 256, and then add and subtract them. 267 00:12:29,160 --> 00:12:30,290 Might be interesting. 268 00:12:30,290 --> 00:12:33,140 So there are other ways of doing that, so in fact-- 269 00:12:33,140 --> 00:12:35,610 I don't know why, it might be in because [UNINTELLIGIBLE]. 270 00:12:35,610 --> 00:12:38,800 I don't know why they didn't do that. 271 00:12:38,800 --> 00:12:40,080 But it's thinking. 272 00:12:40,080 --> 00:12:43,380 It's thinking, look at all instructions, sequence, how 273 00:12:43,380 --> 00:12:45,000 long it would take, and it find 274 00:12:45,000 --> 00:12:45,880 this interesting sequence. 275 00:12:45,880 --> 00:12:49,410 So sometimes when you are looking into optimized code, 276 00:12:49,410 --> 00:12:52,350 they will look like something crazy that you can't read 277 00:12:52,350 --> 00:12:54,100 because it does things like this. 278 00:12:54,100 --> 00:12:56,590 So you found a simple multiply [UNINTELLIGIBLE], and end up 279 00:12:56,590 --> 00:12:58,120 with a piece of code like this. 280 00:12:58,120 --> 00:13:00,940 And so you need to kind of decipher, seeing what's going 281 00:13:00,940 --> 00:13:02,010 on backward. 282 00:13:02,010 --> 00:13:04,360 So that's why reading assembly is sometimes hard, especially 283 00:13:04,360 --> 00:13:06,800 optimized assembly. 284 00:13:06,800 --> 00:13:14,160 OK, so I did absolute value, and look, it did the bithack 285 00:13:14,160 --> 00:13:15,720 we basically learned in class. 286 00:13:15,720 --> 00:13:16,920 You can go look at that. 287 00:13:16,920 --> 00:13:21,150 This is the entire thing that Charles talked about to find 288 00:13:21,150 --> 00:13:21,960 absolute value. 289 00:13:21,960 --> 00:13:25,290 It knew that, so it has attended Charles' lecture. 290 00:13:25,290 --> 00:13:26,140 That [UNINTELLIGIBLE] 291 00:13:26,140 --> 00:13:28,162 programmer. 292 00:13:28,162 --> 00:13:33,230 OK, so here's interesting thing. 293 00:13:33,230 --> 00:13:37,980 So what I did was I am doing update, and I have this big 294 00:13:37,980 --> 00:13:40,900 large array here, and I'm checking the index to be 295 00:13:40,900 --> 00:13:44,500 within 0 and this value to before I updated, because I 296 00:13:44,500 --> 00:13:48,100 don't want to write out the bounds on this setting. 297 00:13:48,100 --> 00:13:48,760 OK? 298 00:13:48,760 --> 00:13:50,740 Makes sense, because I want to make sure that I 299 00:13:50,740 --> 00:13:52,550 write in the bound. 300 00:13:52,550 --> 00:13:55,630 Interesting thing here is I am doing two checks, here. 301 00:13:55,630 --> 00:13:58,280 I end up doing only one check here. 302 00:13:58,280 --> 00:13:59,530 What happened to my other check? 303 00:14:04,220 --> 00:14:08,412 AUDIENCE: [INAUDIBLE], how big was the array, and 304 00:14:08,412 --> 00:14:10,136 [INAUDIBLE]. 305 00:14:10,136 --> 00:14:11,772 PROFESSOR: No, no, [UNINTELLIGIBLE] something a 306 00:14:11,772 --> 00:14:14,540 lot more simpler. 307 00:14:14,540 --> 00:14:19,180 So to give you a hint, this is unsigned value, and I'm doing 308 00:14:19,180 --> 00:14:20,430 unsigned compare. 309 00:14:25,400 --> 00:14:29,810 What happens if the value is more than 0? 310 00:14:29,810 --> 00:14:32,270 A signed value that's smaller than 0 is what 311 00:14:32,270 --> 00:14:34,020 it is like in unsigned. 312 00:14:34,020 --> 00:14:36,540 It's a huge number. 313 00:14:36,540 --> 00:14:37,150 It [UNINTELLIGIBLE] 314 00:14:37,150 --> 00:14:41,340 to be bigger than this one, so because of that, it can just 315 00:14:41,340 --> 00:14:43,360 say, OK look, I don't have to check that. 316 00:14:43,360 --> 00:14:46,160 I can unsigned compare, and I will get anything less than 317 00:14:46,160 --> 00:14:47,490 zero also in there. 318 00:14:47,490 --> 00:14:49,590 So one more thing. 319 00:14:49,590 --> 00:14:53,675 So before I continue with this one, so if I actually put it 320 00:14:53,675 --> 00:14:58,180 in a loop and say I'm going to trade [UNINTELLIGIBLE] to this 321 00:14:58,180 --> 00:14:59,580 value in here. 322 00:14:59,580 --> 00:15:03,420 Then what it'll do is, at that point, it'll inline this. 323 00:15:03,420 --> 00:15:05,030 And when it can complete [UNINTELLIGIBLE] near the 324 00:15:05,030 --> 00:15:09,600 check, it will know that, in fact, my bound is going from 0 325 00:15:09,600 --> 00:15:12,060 to this one, so I don't have to check the bound. 326 00:15:12,060 --> 00:15:15,880 So what this did was this inlined this function in here 327 00:15:15,880 --> 00:15:19,310 and completely get rid of the checks completely, and this is 328 00:15:19,310 --> 00:15:22,110 basically the branch condition in here because 329 00:15:22,110 --> 00:15:22,820 it said, OK, look. 330 00:15:22,820 --> 00:15:25,730 These things are redundant because I know because I'm 331 00:15:25,730 --> 00:15:27,900 trading from this to this value within these bounds 332 00:15:27,900 --> 00:15:29,880 [UNINTELLIGIBLE]. 333 00:15:29,880 --> 00:15:32,320 So it is smart in that. 334 00:15:32,320 --> 00:15:35,060 Do you see this, how this is going? 335 00:15:35,060 --> 00:15:35,940 Cool stuff the compiler does. 336 00:15:35,940 --> 00:15:40,140 So this is why compilers are smart, when they are smart. 337 00:15:40,140 --> 00:15:42,880 And the interesting thing is, here's another one. 338 00:15:42,880 --> 00:15:44,120 So now see [UNINTELLIGIBLE] 339 00:15:44,120 --> 00:15:47,040 imagine, because less than 0, I can do that. 340 00:15:47,040 --> 00:15:51,180 How about if I'm checking from 5,000? 341 00:15:51,180 --> 00:15:53,390 So we generated this funky code. 342 00:15:53,390 --> 00:15:59,680 It subtracted a 6 here in the value, and then checked for 343 00:15:59,680 --> 00:16:01,200 [UNINTELLIGIBLE] 344 00:16:01,200 --> 00:16:05,250 because it kind of shifted the value to a 0 basis, basically. 345 00:16:05,250 --> 00:16:09,360 And then you can check that thing, and then can basically 346 00:16:09,360 --> 00:16:13,000 get two conditions down to one. 347 00:16:13,000 --> 00:16:17,410 See, the thing is there are many places where bound checks 348 00:16:17,410 --> 00:16:19,050 is very important. 349 00:16:19,050 --> 00:16:21,040 If you are doing a lot of adding compilation stuff like 350 00:16:21,040 --> 00:16:23,180 that, if you don't want to have buffer overflows and 351 00:16:23,180 --> 00:16:24,750 stuff, you want it with bound checks. 352 00:16:24,750 --> 00:16:27,390 And so optimizing bound checks is a very important thing. 353 00:16:27,390 --> 00:16:30,380 So having these kind of things can, in many programs, 354 00:16:30,380 --> 00:16:32,470 probably give good performance, so that's why 355 00:16:32,470 --> 00:16:36,200 compilers are really good at it and spend time trying to do 356 00:16:36,200 --> 00:16:36,870 bound checks. 357 00:16:36,870 --> 00:16:39,460 So this is kind of interesting way of doing that. 358 00:16:42,580 --> 00:16:45,630 So the next thing I want to look at is vectorization 359 00:16:45,630 --> 00:16:51,140 because all these machines we have have this as the 360 00:16:51,140 --> 00:16:53,820 instructions, that can run really fast, and you probably 361 00:16:53,820 --> 00:16:57,850 saw it in there, and see what kind of code will get produced 362 00:16:57,850 --> 00:16:59,180 after doing something like that. 363 00:16:59,180 --> 00:17:01,980 So here's a simple program. 364 00:17:01,980 --> 00:17:08,220 So I have two arrays, and I'm just copying A to B, something 365 00:17:08,220 --> 00:17:09,050 very simple. 366 00:17:09,050 --> 00:17:11,569 And also the other thing to notice, I know exactly from 367 00:17:11,569 --> 00:17:14,109 where to where I'm copying, and I also know which arrays 368 00:17:14,109 --> 00:17:16,470 I'm copying, so when you look at it, it 369 00:17:16,470 --> 00:17:18,250 produces a code like this. 370 00:17:18,250 --> 00:17:22,890 So what it's doing is it's basically making eax0 here by 371 00:17:22,890 --> 00:17:24,510 doing xorl. 372 00:17:24,510 --> 00:17:32,090 And then basically, moving the value A into the xmm 373 00:17:32,090 --> 00:17:33,870 registers, much larger. 374 00:17:33,870 --> 00:17:35,040 Instead of having to [UNINTELLIGIBLE] 375 00:17:35,040 --> 00:17:38,970 16 of them, and copying it back into B. 376 00:17:38,970 --> 00:17:41,190 So basically, you are doing copying here, an 377 00:17:41,190 --> 00:17:42,230 increment by 16. 378 00:17:42,230 --> 00:17:44,968 By now, every refresh, you're coping 16 of them. 379 00:17:48,180 --> 00:17:51,070 Why could I just be done with just putting this 380 00:17:51,070 --> 00:17:52,070 small piece of code? 381 00:17:52,070 --> 00:17:55,130 What additional information this is taking advantage of? 382 00:17:58,064 --> 00:18:00,020 AUDIENCE: Does it know that [INAUDIBLE]? 383 00:18:03,940 --> 00:18:06,910 PROFESSOR: Exactly, because it knows that it goes from 0 to 384 00:18:06,910 --> 00:18:07,480 this value. 385 00:18:07,480 --> 00:18:09,550 In fact, it knows it's a multiple of 16. 386 00:18:09,550 --> 00:18:10,480 So it knows that. 387 00:18:10,480 --> 00:18:11,830 That's why it do that. 388 00:18:11,830 --> 00:18:19,360 It knows exactly, and these things are nicely aligned to 389 00:18:19,360 --> 00:18:21,150 the boundaries, word boundaries. 390 00:18:21,150 --> 00:18:22,270 So it knows that. 391 00:18:22,270 --> 00:18:24,750 So I can read that, and I know all those facts, and that is 392 00:18:24,750 --> 00:18:27,400 why I can do this computation. 393 00:18:27,400 --> 00:18:30,520 So now, you start doing that, did one simple change. 394 00:18:30,520 --> 00:18:36,150 You start going from value, I went to end. 395 00:18:36,150 --> 00:18:36,990 0 to end. 396 00:18:36,990 --> 00:18:40,070 I know where it starts, and I know where it's ending. 397 00:18:40,070 --> 00:18:45,010 Ending is somewhere at N. I don't know where the end is. 398 00:18:45,010 --> 00:18:48,080 Then this has to do something a little bit difficult. 399 00:18:48,080 --> 00:18:51,890 So the code produced looks like this because, now, I'm 400 00:18:51,890 --> 00:18:53,050 not going to go through this code. 401 00:18:53,050 --> 00:18:55,760 The only thing to say, this is actually doing still a memx 402 00:18:55,760 --> 00:18:59,100 instruction, but its trying to make sure that because N it 403 00:18:59,100 --> 00:19:01,750 might not be a multiple of 16. 404 00:19:01,750 --> 00:19:06,160 You have to take care of the final number of iterations 405 00:19:06,160 --> 00:19:09,400 outside that, so you had to go up to the multiple, and then 406 00:19:09,400 --> 00:19:12,170 basically do a normal loop one at a time. 407 00:19:12,170 --> 00:19:13,850 So as we produce a little bit of a more 408 00:19:13,850 --> 00:19:15,960 complicated piece like that. 409 00:19:15,960 --> 00:19:17,080 So that's [UNINTELLIGIBLE] 410 00:19:17,080 --> 00:19:18,690 compiler has to do. 411 00:19:18,690 --> 00:19:21,560 And so then you have a piece of code like this. 412 00:19:30,760 --> 00:19:37,200 The interesting thing here is, now, I created basically a 413 00:19:37,200 --> 00:19:41,890 function, where it's not A and B. I'm giving two arrays as 414 00:19:41,890 --> 00:19:45,200 arguments, and then I'm giving a size to copy, and I'm 415 00:19:45,200 --> 00:19:46,450 copying that. 416 00:19:46,450 --> 00:19:49,710 And I would have an extremely complicated thing that's 417 00:19:49,710 --> 00:19:52,420 getting generated. 418 00:19:52,420 --> 00:19:53,650 Why is it complicated? 419 00:19:53,650 --> 00:19:57,830 What do I have to know, when I get this function, to make 420 00:19:57,830 --> 00:20:01,620 sure that, first of all, it's still doing xmm 421 00:20:01,620 --> 00:20:04,640 somewhere in here. 422 00:20:04,640 --> 00:20:09,330 What that means is it's trying to do this very fast copy of a 423 00:20:09,330 --> 00:20:11,240 multiple using [UNINTELLIGIBLE] instruction. 424 00:20:11,240 --> 00:20:14,280 But what else can happen in this function? 425 00:20:14,280 --> 00:20:16,620 Because compilers delete all the cases. 426 00:20:16,620 --> 00:20:18,490 What are other cases tests deal with? 427 00:20:21,286 --> 00:20:22,690 AUDIENCE: May not be aligned. 428 00:20:22,690 --> 00:20:25,680 PROFESSOR: May not be aligned because, for example, because 429 00:20:25,680 --> 00:20:29,460 xmm assumes that they had the word boundaries. 430 00:20:29,460 --> 00:20:33,490 When you read 16 bytes, we assume it's aligned with 431 00:20:33,490 --> 00:20:34,740 16-byte boundary. 432 00:20:34,740 --> 00:20:36,700 It might not be aligned, so you have no idea where these 433 00:20:36,700 --> 00:20:37,470 two are coming from. 434 00:20:37,470 --> 00:20:39,400 So that's one thing is they might not be aligned. 435 00:20:39,400 --> 00:20:39,870 What else? 436 00:20:39,870 --> 00:20:42,345 AUDIENCE: They don't even have to be a [INAUDIBLE]. 437 00:20:42,345 --> 00:20:47,295 Because I mean, that thing could just be copying up to N. 438 00:20:47,295 --> 00:20:50,760 But it might just be copying partially parts of the array. 439 00:20:50,760 --> 00:20:52,120 PROFESSOR: Yes, yeah, that's true. 440 00:20:52,120 --> 00:20:53,840 So what that means is because arrays [UNINTELLIGIBLE] it, 441 00:20:53,840 --> 00:20:56,370 but arrays is somewhere in memory, just you do two-point 442 00:20:56,370 --> 00:20:57,610 to starting point. 443 00:20:57,610 --> 00:20:59,840 That is what x and y are there, two starting points in 444 00:20:59,840 --> 00:21:02,160 main memory, and start copying there. 445 00:21:02,160 --> 00:21:05,240 So what else can happen because of that? 446 00:21:05,240 --> 00:21:08,410 Because in A and B, we knew they were two separate arrays. 447 00:21:08,410 --> 00:21:09,660 What else can happen? 448 00:21:15,996 --> 00:21:16,980 Back there. 449 00:21:16,980 --> 00:21:19,932 AUDIENCE: [INAUDIBLE]. 450 00:21:19,932 --> 00:21:20,430 PROFESSOR: Yes. 451 00:21:20,430 --> 00:21:23,540 So arrays can start to overlap, and if arrays are 452 00:21:23,540 --> 00:21:26,060 overlapping, then you might end up in 453 00:21:26,060 --> 00:21:27,100 an interesting situation. 454 00:21:27,100 --> 00:21:29,140 So you have to figure out if that arrays are overlapping, 455 00:21:29,140 --> 00:21:30,270 whether they're aligned. 456 00:21:30,270 --> 00:21:31,660 Actually, there are two types of aligned. 457 00:21:31,660 --> 00:21:37,460 One is self-aligning, so assume we took these arrays 458 00:21:37,460 --> 00:21:41,550 and start copying from the byte 3. 459 00:21:41,550 --> 00:21:44,870 So then what you know is basically byte 3 to 16, it's 460 00:21:44,870 --> 00:21:45,600 not aligned. 461 00:21:45,600 --> 00:21:47,950 You can't start copying chunks in there. 462 00:21:47,950 --> 00:21:54,520 So you run bytes, but when you run up to 13 iterations, then 463 00:21:54,520 --> 00:21:57,950 you end up in, again, aligned chunks. 464 00:21:57,950 --> 00:22:01,710 So in that case, you just run the sum preamble to first 465 00:22:01,710 --> 00:22:03,850 aligned place, and then you go to aligned chunk. 466 00:22:03,850 --> 00:22:08,000 But if this is starting to at 3, this is starting at 8, then 467 00:22:08,000 --> 00:22:09,160 they're never going to be aligned. 468 00:22:09,160 --> 00:22:11,500 The things are copying A and B are not aligned, so then you 469 00:22:11,500 --> 00:22:13,650 have to treat it differently. 470 00:22:13,650 --> 00:22:16,680 So there's a lot of different cases you have to do, so if 471 00:22:16,680 --> 00:22:19,430 you just give something like this, the problem is the 472 00:22:19,430 --> 00:22:21,500 compiler has to deal with these thousands 473 00:22:21,500 --> 00:22:22,600 of different cases. 474 00:22:22,600 --> 00:22:25,800 And in this one, since this small, it probably tore 475 00:22:25,800 --> 00:22:28,870 through all the possible cases at the [UNINTELLIGIBLE]. 476 00:22:28,870 --> 00:22:30,730 So it's dealing with all those, and checking over 477 00:22:30,730 --> 00:22:33,690 everything and trying to find optimal case and do that fast, 478 00:22:33,690 --> 00:22:35,890 hoping that you get optimal case for it. 479 00:22:35,890 --> 00:22:38,010 But if you have something more complicated, the compiler 480 00:22:38,010 --> 00:22:39,250 won't be able to do all of those things, so 481 00:22:39,250 --> 00:22:40,110 it might give up. 482 00:22:40,110 --> 00:22:43,200 So the interesting thing to here note is more information 483 00:22:43,200 --> 00:22:44,840 to go into the compiler is better. 484 00:22:44,840 --> 00:22:49,020 And then here, compiler has to divide a lot of things, but 485 00:22:49,020 --> 00:22:51,020 probably not happen. 486 00:22:51,020 --> 00:22:54,620 Another interesting thing is now, the first time I just 487 00:22:54,620 --> 00:23:00,620 copy to where is A to B, in memcpy4, I just call memcpy3 488 00:23:00,620 --> 00:23:07,080 by doing the same thing, A to B, copy 1024 [UNINTELLIGIBLE]. 489 00:23:07,080 --> 00:23:09,250 This is the beauty of inlining. 490 00:23:09,250 --> 00:23:16,100 So what it did was, in memcpy4, it inline memcpy3 and 491 00:23:16,100 --> 00:23:21,310 substitute X and Y to A and B and end 2,024. 492 00:23:21,310 --> 00:23:23,530 And it realized that it doesn't have to do all these 493 00:23:23,530 --> 00:23:25,450 tests like it did. 494 00:23:25,450 --> 00:23:29,785 What it generated is very close to what we got here 495 00:23:29,785 --> 00:23:31,460 because after inlining, it should realize, wait a minute, 496 00:23:31,460 --> 00:23:33,390 I'm copying A to B. I know we have the start. 497 00:23:33,390 --> 00:23:33,980 I know we have the end. 498 00:23:33,980 --> 00:23:34,570 I know the size. 499 00:23:34,570 --> 00:23:36,880 I know all of these things, and I don't have to do any of 500 00:23:36,880 --> 00:23:37,360 these things. 501 00:23:37,360 --> 00:23:42,320 I can actually generate this very simple piece of code. 502 00:23:42,320 --> 00:23:46,080 So I think that is a neat thing. 503 00:23:46,080 --> 00:23:50,770 What this shows you is, in fact, if you can build this 504 00:23:50,770 --> 00:23:54,700 general function of things in there, and then you can call 505 00:23:54,700 --> 00:23:59,210 them, and if it is done right, the inlining will basically do 506 00:23:59,210 --> 00:23:59,735 all optimizations. 507 00:23:59,735 --> 00:24:03,500 So you don't have to have 50 different memcpies for all the 508 00:24:03,500 --> 00:24:06,100 different things in your code. 509 00:24:06,100 --> 00:24:09,110 If you wrote a general function, and you call it in a 510 00:24:09,110 --> 00:24:12,900 way it can get inline and got that as efficient as possible 511 00:24:12,900 --> 00:24:14,175 as hand optimization. 512 00:24:17,665 --> 00:24:20,360 I think it's a real interesting thing, and what 513 00:24:20,360 --> 00:24:24,010 does for you, when you're doing projects, you don't have 514 00:24:24,010 --> 00:24:27,620 to do all of these very complex and small functions, 515 00:24:27,620 --> 00:24:29,260 hand inline stuff like that. 516 00:24:29,260 --> 00:24:32,740 But it's always good to check that, in fact, the compiler's 517 00:24:32,740 --> 00:24:33,940 doing that because you don't know. 518 00:24:33,940 --> 00:24:35,880 You assume the compiler's doing that, and there might be 519 00:24:35,880 --> 00:24:36,890 cases it might not be. 520 00:24:36,890 --> 00:24:40,640 And I will show you one example here. 521 00:24:40,640 --> 00:24:43,634 So I want you guys to look at this function a little bit. 522 00:24:43,634 --> 00:24:44,470 OK? 523 00:24:44,470 --> 00:24:46,500 I am doing two memcpies. 524 00:24:46,500 --> 00:24:48,620 I am copying 1,024 elements. 525 00:24:48,620 --> 00:24:53,150 One, I'm doing ai+1, a into-- 526 00:24:53,150 --> 00:24:59,210 this is XY, this is X get copied into Y. ai+1 to A, so 527 00:24:59,210 --> 00:25:07,690 that means I have array like that, array like this. 528 00:25:11,090 --> 00:25:14,090 I am giving ai+1 as the source. 529 00:25:14,090 --> 00:25:17,720 I am giving this as the source, and I'm doing this as 530 00:25:17,720 --> 00:25:18,970 the destination. 531 00:25:21,570 --> 00:25:23,120 OK, what does this copy do? 532 00:25:28,830 --> 00:25:35,960 I'm copying 1,024, yes. 533 00:25:35,960 --> 00:25:39,870 So the first one, this one gets copied to here. 534 00:25:39,870 --> 00:25:41,390 Second iteration, this will get copied. 535 00:25:41,390 --> 00:25:43,250 Third iteration, this will get copied to here. 536 00:25:43,250 --> 00:25:44,672 What does it do? 537 00:25:44,672 --> 00:25:49,600 Yeah, I just do one left-shift of the array. 538 00:25:49,600 --> 00:25:50,850 My second example. 539 00:25:56,710 --> 00:26:00,720 I give this as my first element. 540 00:26:00,720 --> 00:26:02,550 This as my source. 541 00:26:02,550 --> 00:26:05,840 This as my destination. 542 00:26:05,840 --> 00:26:07,161 What happens here? 543 00:26:07,161 --> 00:26:09,570 AUDIENCE: All your copies have the same number. 544 00:26:09,570 --> 00:26:10,470 PROFESSOR: Exactly. 545 00:26:10,470 --> 00:26:14,400 All of them will copy the same number. 546 00:26:14,400 --> 00:26:22,860 OK, so now, if you look at the code that's been produced, so 547 00:26:22,860 --> 00:26:26,110 the interesting thing here is it realizes, in this one, I 548 00:26:26,110 --> 00:26:31,530 can still do mmx because I can still copy, take a chunk, and 549 00:26:31,530 --> 00:26:34,020 copy it one back, take a chunk and copy it one back, take a 550 00:26:34,020 --> 00:26:35,270 chunk and copy it one back. 551 00:26:39,590 --> 00:26:42,300 OK, do you see that? 552 00:26:42,300 --> 00:26:44,060 But what does the next one do? 553 00:26:48,460 --> 00:26:49,540 [UNINTELLIGIBLE] 554 00:26:49,540 --> 00:26:51,905 mmx is, and what does this one do? 555 00:26:57,020 --> 00:26:58,490 Copying something from dl-- 556 00:27:18,510 --> 00:27:19,390 [? movzdl ?] 557 00:27:19,390 --> 00:27:21,050 array expressed dl. 558 00:27:25,730 --> 00:27:28,510 Reverse dl, is this copied, this one? 559 00:27:32,640 --> 00:27:33,640 I hope I copied it properly. 560 00:27:33,640 --> 00:27:35,170 That doesn't look right to me. 561 00:27:45,040 --> 00:27:45,890 So this is interesting. 562 00:27:45,890 --> 00:27:48,030 So I might have missed [UNINTELLIGIBLE]. 563 00:27:48,030 --> 00:27:49,740 I think it takes-- 564 00:27:53,830 --> 00:27:55,462 This doesn't look right, does it? 565 00:27:55,462 --> 00:27:57,310 AUDIENCE: So what's a bound? 566 00:27:57,310 --> 00:27:59,158 It's char, I see. 567 00:27:59,158 --> 00:28:00,082 One byte. 568 00:28:00,082 --> 00:28:01,190 PROFESSOR: Yeah, one byte. 569 00:28:01,190 --> 00:28:03,382 AUDIENCE: So what this is doing is just taking the first 570 00:28:03,382 --> 00:28:04,950 byte, and then just moving it. 571 00:28:04,950 --> 00:28:06,160 PROFESSOR: Into edx? 572 00:28:06,160 --> 00:28:06,930 AUDIENCE: [INAUDIBLE]. 573 00:28:06,930 --> 00:28:08,080 PROFESSOR: Oh, right, this is actually 574 00:28:08,080 --> 00:28:08,850 doing the right thing. 575 00:28:08,850 --> 00:28:11,130 It's doing the right thing, so what it does is this move this 576 00:28:11,130 --> 00:28:16,800 one into edx, entire thing, and then this calls the dl, it 577 00:28:16,800 --> 00:28:19,390 gets the first byte out of edx. 578 00:28:19,390 --> 00:28:21,460 Do you see what's happening here? 579 00:28:21,460 --> 00:28:22,580 So a gets into-- 580 00:28:22,580 --> 00:28:23,780 first [UNINTELLIGIBLE] 581 00:28:23,780 --> 00:28:27,260 the first byte out in here, and then you keep copying that 582 00:28:27,260 --> 00:28:31,900 byte one at a time into this location. 583 00:28:31,900 --> 00:28:34,900 AUDIENCE: So it's not smart enough to [INAUDIBLE]. 584 00:28:34,900 --> 00:28:37,230 PROFESSOR: So this is where we find something interesting in 585 00:28:37,230 --> 00:28:39,390 the compiler. 586 00:28:39,390 --> 00:28:40,640 AUDIENCE: [INAUDIBLE]? 587 00:28:42,870 --> 00:28:54,450 PROFESSOR: So what doing is, in here, you are copying this 588 00:28:54,450 --> 00:28:57,130 byte into edx. 589 00:28:57,130 --> 00:29:00,120 So dl is the first bite out of edl. 590 00:29:00,120 --> 00:29:01,450 That's byte, because what-- 591 00:29:01,450 --> 00:29:02,260 AUDIENCE: The first byte. 592 00:29:02,260 --> 00:29:07,000 PROFESSOR: Yes, because except the six address themes, do you 593 00:29:07,000 --> 00:29:13,100 do r32e, r64e, 32, and just dl is just first byte 594 00:29:13,100 --> 00:29:14,950 [UNINTELLIGIBLE], but [UNINTELLIGIBLE] 595 00:29:14,950 --> 00:29:16,160 low byte of that. 596 00:29:16,160 --> 00:29:18,190 d is just the higher byte. 597 00:29:18,190 --> 00:29:21,320 AUDIENCE: So why does it just take the first byte? 598 00:29:21,320 --> 00:29:24,260 PROFESSOR: Because this byte get copied 599 00:29:24,260 --> 00:29:26,660 into everything here. 600 00:29:26,660 --> 00:29:28,510 So do you see that? 601 00:29:28,510 --> 00:29:30,790 This byte is the one that-- because that's what happened 602 00:29:30,790 --> 00:29:33,990 when this copy is basically this byte got everything 603 00:29:33,990 --> 00:29:34,320 [UNINTELLIGIBLE] 604 00:29:34,320 --> 00:29:36,195 got replaced by this first byte in here. 605 00:29:36,195 --> 00:29:41,880 And you incorporate in there, and then it just goes around 606 00:29:41,880 --> 00:29:42,900 copying it in here. 607 00:29:42,900 --> 00:29:47,420 So I try, I did it this way, so what I did was basically 608 00:29:47,420 --> 00:29:49,630 went from 1 to 1,025. 609 00:29:49,630 --> 00:29:50,610 [UNINTELLIGIBLE] 610 00:29:50,610 --> 00:29:50,890 0. 611 00:29:50,890 --> 00:29:52,980 This is basically what happened. 612 00:29:52,980 --> 00:29:58,820 So one 2,025, I got a 0, and then basically that's what 613 00:29:58,820 --> 00:30:01,950 happens in here, same thing here. 614 00:30:01,950 --> 00:30:03,200 OK? 615 00:30:04,930 --> 00:30:07,780 Do you see what's going on? 616 00:30:07,780 --> 00:30:12,410 But I just did something else, so assume [UNINTELLIGIBLE] 617 00:30:12,410 --> 00:30:15,700 is doing right here. 618 00:30:15,700 --> 00:30:17,850 What happens if I just use something else? 619 00:30:17,850 --> 00:30:22,310 I use B[0], so it should be the same, isn't it? 620 00:30:22,310 --> 00:30:23,220 If I use b[0] 621 00:30:23,220 --> 00:30:29,935 here, instead of doing A[i], this should be B[i], isn't it? 622 00:30:29,935 --> 00:30:31,330 AUDIENCE: I'm sorry, where is [INAUDIBLE]? 623 00:30:31,330 --> 00:30:32,090 PROFESSOR: Yeah. 624 00:30:32,090 --> 00:30:34,120 It's a different array. 625 00:30:34,120 --> 00:30:35,190 Sorry, I didn't put it here. 626 00:30:35,190 --> 00:30:35,400 It's a different array. 627 00:30:35,400 --> 00:30:36,460 So instead of A[0] 628 00:30:36,460 --> 00:30:40,122 here, I put another array, B[0], here. 629 00:30:43,080 --> 00:30:44,270 Does it matter whether that's A[0] 630 00:30:44,270 --> 00:30:45,140 or B[0]? 631 00:30:45,140 --> 00:30:45,480 B[0] 632 00:30:45,480 --> 00:30:46,730 is a different array. 633 00:30:48,920 --> 00:30:51,830 It shouldn't matter because a different array, different 634 00:30:51,830 --> 00:30:54,400 element that you don't do that, but the interesting 635 00:30:54,400 --> 00:30:56,280 thing is if you do that, it managed to 636 00:30:56,280 --> 00:30:58,100 convert it into mmx. 637 00:30:58,100 --> 00:31:02,113 So this is where the compiler is basically falling short a 638 00:31:02,113 --> 00:31:07,404 little bit because it could have done this for these two. 639 00:31:07,404 --> 00:31:09,700 OK, it's the same thing because what it does is it 640 00:31:09,700 --> 00:31:12,190 takes this one element from me and copy it everywhere. 641 00:31:12,190 --> 00:31:16,810 I could have done it, but for some reason, the compiler 642 00:31:16,810 --> 00:31:19,200 decided if using a different array, B[0]. 643 00:31:19,200 --> 00:31:20,780 I can do it, but I'm doing A[0] 644 00:31:20,780 --> 00:31:24,650 [UNINTELLIGIBLE], even though these two are basically 645 00:31:24,650 --> 00:31:25,970 identical except these are different. 646 00:31:25,970 --> 00:31:28,110 I'm not doing any kind of [UNINTELLIGIBLE], anything. 647 00:31:28,110 --> 00:31:29,046 Question? 648 00:31:29,046 --> 00:31:30,296 AUDIENCE: [INAUDIBLE]? 649 00:31:32,308 --> 00:31:35,120 PROFESSOR: Yes, so what does is when it goes somewhere, 650 00:31:35,120 --> 00:31:37,050 it's doing pac. 651 00:31:37,050 --> 00:31:38,990 [UNINTELLIGIBLE] pac instructions in here. 652 00:31:38,990 --> 00:31:42,570 What it does is it takes a byte and kind of makes 653 00:31:42,570 --> 00:31:45,040 multiple copies of the byte and create a larger copy than 654 00:31:45,040 --> 00:31:48,270 the 16 copies in there, and then it can kind of stamp it 655 00:31:48,270 --> 00:31:48,760 everywhere. 656 00:31:48,760 --> 00:31:51,700 AUDIENCE: [INAUDIBLE] 657 00:31:51,700 --> 00:31:54,640 because the A[0] would go off of A[1]. 658 00:31:54,640 --> 00:31:56,380 PROFESSOR: So what it's doing is copying A[0] 659 00:31:56,380 --> 00:31:58,980 multiple times one at a times slowly. 660 00:31:58,980 --> 00:32:03,850 So what this does is it takes B[0], make 16 copies in there, 661 00:32:03,850 --> 00:32:05,720 in registers, not in memory. 662 00:32:05,720 --> 00:32:08,230 I created a template of 16, and I kind of 663 00:32:08,230 --> 00:32:09,660 stamp it as I go. 664 00:32:09,660 --> 00:32:11,730 AUDIENCE: That's probably failure of its alias analysis. 665 00:32:11,730 --> 00:32:12,900 PROFESSOR: Yeah, it's failure [UNINTELLIGIBLE]. 666 00:32:12,900 --> 00:32:15,720 So this is where the compiler does some magic, kind of very 667 00:32:15,720 --> 00:32:17,060 complex things in here. 668 00:32:17,060 --> 00:32:20,810 But somewhere in the compiler it failed. 669 00:32:20,810 --> 00:32:26,840 so what you want in here is a code looking like this, but it 670 00:32:26,840 --> 00:32:27,570 produces this one. 671 00:32:27,570 --> 00:32:30,080 So this is where the compilers are great. 672 00:32:30,080 --> 00:32:36,620 It do some amazing things, but it's not infallible. 673 00:32:36,620 --> 00:32:39,270 It can do, there might be corner cases that data 674 00:32:39,270 --> 00:32:41,120 analysis fails. 675 00:32:41,120 --> 00:32:44,390 So that's why it's always good, even though you can take 676 00:32:44,390 --> 00:32:47,010 advantage of the compiler, to look at what's generating. 677 00:32:47,010 --> 00:32:49,350 And sometimes when you tweak around it, you suddenly 678 00:32:49,350 --> 00:32:50,140 realize, wait a minute. 679 00:32:50,140 --> 00:32:53,090 I can get the compiler do something better, and then you 680 00:32:53,090 --> 00:32:55,740 can say wait a minute, now how do I work myself back? 681 00:32:55,740 --> 00:33:00,520 Sometimes, you end up changing your SQL a little bit like the 682 00:33:00,520 --> 00:33:07,670 examples, the TAs showed when they were doing their demo, 683 00:33:07,670 --> 00:33:10,200 that if you tweak it a little, you can actually get the 684 00:33:10,200 --> 00:33:11,740 compiler to do that, instead of trying to do 685 00:33:11,740 --> 00:33:12,990 these things by hand. 686 00:33:15,050 --> 00:33:17,010 So the compilers are powerful, but you have to be careful. 687 00:33:20,070 --> 00:33:23,460 Another interesting thing is this factorial here. 688 00:33:23,460 --> 00:33:26,080 So normal factorial is basically you call a function 689 00:33:26,080 --> 00:33:28,890 call with x-1 and multiply by x. 690 00:33:28,890 --> 00:33:34,060 But you know functions calls are very expensive, and in 691 00:33:34,060 --> 00:33:36,750 fact, GCC knows that, too. 692 00:33:36,750 --> 00:33:40,580 So what GCC did, it basically eliminated the function call 693 00:33:40,580 --> 00:33:42,660 and converted it into its [UNINTELLIGIBLE] 694 00:33:42,660 --> 00:33:43,620 function here. 695 00:33:43,620 --> 00:33:48,660 So if EDA got x in here, and then it goes to a loop, so it 696 00:33:48,660 --> 00:33:50,250 first check with [UNINTELLIGIBLE]. 697 00:33:50,250 --> 00:33:51,900 If it is one, you go to the end and return it. 698 00:33:51,900 --> 00:33:56,150 You are done if x is 1 or less than 1. 699 00:33:56,150 --> 00:33:57,890 And otherwise, it goes through a loop. 700 00:33:57,890 --> 00:33:59,330 It doesn't go do any function call. 701 00:33:59,330 --> 00:34:04,260 It just basically calculate this fact value inside EAX and 702 00:34:04,260 --> 00:34:07,100 keep multiplying [UNINTELLIGIBLE]. 703 00:34:07,100 --> 00:34:13,219 So it can take this simple recursive functions, and also 704 00:34:13,219 --> 00:34:15,030 convert it into [UNINTELLIGIBLE]. 705 00:34:15,030 --> 00:34:18,080 So it does some very, very fancy stuff in the compiler. 706 00:34:18,080 --> 00:34:20,320 So the compilers are fun when they work. 707 00:34:20,320 --> 00:34:24,469 But the key thing is there are many cases it doesn't work. 708 00:34:24,469 --> 00:34:26,179 So next, I want to switch gears. 709 00:34:26,179 --> 00:34:28,920 Any questions so far? 710 00:34:28,920 --> 00:34:31,380 Sometimes it's fun to find breaking 711 00:34:31,380 --> 00:34:32,089 points in the compiler. 712 00:34:32,089 --> 00:34:33,339 AUDIENCE: [INAUDIBLE]? 713 00:34:39,234 --> 00:34:41,389 PROFESSOR: If I use, in some-- 714 00:34:41,389 --> 00:34:42,659 this is not static? 715 00:34:42,659 --> 00:34:44,250 AUDIENCE: Yeah. 716 00:34:44,250 --> 00:34:45,920 PROFESSOR: No, it won't make a difference because what it 717 00:34:45,920 --> 00:34:48,440 says is if it is not static, it's visible to the outside 718 00:34:48,440 --> 00:34:52,360 world, but within this function, it's the same. 719 00:34:52,360 --> 00:34:57,200 So what it does, it kind of like limiting my pollution 720 00:34:57,200 --> 00:34:59,800 because otherwise what happens is everybody outside will see 721 00:34:59,800 --> 00:35:02,130 these names. 722 00:35:02,130 --> 00:35:04,150 So if you use that name again somewhere, it might 723 00:35:04,150 --> 00:35:05,415 just use this one. 724 00:35:05,415 --> 00:35:06,850 [UNINTELLIGIBLE PHRASE] 725 00:35:06,850 --> 00:35:09,630 this is within the file, nobody else should see this. 726 00:35:09,630 --> 00:35:11,700 It's creating a local copy. 727 00:35:11,700 --> 00:35:15,950 It's kind of a poor man's class heierarchy. 728 00:35:15,950 --> 00:35:21,170 In Java, basically, each file is a single class, and you 729 00:35:21,170 --> 00:35:24,420 make sure that things inside the class is 730 00:35:24,420 --> 00:35:26,080 not visible to outside. 731 00:35:26,080 --> 00:35:29,660 When you make static, you made it only visible within that 732 00:35:29,660 --> 00:35:31,380 file, so you kind of make [UNINTELLIGIBLE]. 733 00:35:31,380 --> 00:35:35,310 You can think about your file as your class, and so you 734 00:35:35,310 --> 00:35:38,710 limit the scope of the variable doing that. 735 00:35:38,710 --> 00:35:42,156 So some benefits of object-orientedness can be 736 00:35:42,156 --> 00:35:44,570 [UNINTELLIGIBLE], I guess. 737 00:35:44,570 --> 00:35:45,410 [UNINTELLIGIBLE] static variable, 738 00:35:45,410 --> 00:35:47,130 it's not a class variable. 739 00:35:52,260 --> 00:35:56,910 OK, so next, before I get into doing compilers and say what 740 00:35:56,910 --> 00:35:59,580 compilers do, I want to give you a [UNINTELLIGIBLE]. 741 00:35:59,580 --> 00:36:01,970 There are many different places where you can do 742 00:36:01,970 --> 00:36:02,955 optimization. 743 00:36:02,955 --> 00:36:06,270 So if you look at what happens in the program, program first 744 00:36:06,270 --> 00:36:09,390 goes through compile time, compiles each file, then it 745 00:36:09,390 --> 00:36:11,200 links all the files together. 746 00:36:11,200 --> 00:36:14,790 At some point, the files will get loaded into your machine, 747 00:36:14,790 --> 00:36:16,930 and then it'll be running. 748 00:36:16,930 --> 00:36:20,480 So if you load things in a compiler, you have full access 749 00:36:20,480 --> 00:36:25,440 to source code, it's very easy to kind of look at the 750 00:36:25,440 --> 00:36:28,380 high-level transformation, low-level transformation, you 751 00:36:28,380 --> 00:36:32,400 can look at the entire gamut of things to do. 752 00:36:32,400 --> 00:36:33,940 And the nice thing about compilers, 753 00:36:33,940 --> 00:36:35,370 compilers can be slow. 754 00:36:35,370 --> 00:36:36,480 Nobody's going to complain. 755 00:36:36,480 --> 00:36:39,470 It's not going to be part of your run-time, so you just 756 00:36:39,470 --> 00:36:42,940 would wait, but it's you, not the customer. 757 00:36:42,940 --> 00:36:45,050 But the problem with compilers, it doesn't see the 758 00:36:45,050 --> 00:36:45,660 whole programs. 759 00:36:45,660 --> 00:36:48,270 You see a file at a time, so all this inline things and 760 00:36:48,270 --> 00:36:49,340 stuff has to be in the file. 761 00:36:49,340 --> 00:36:51,130 You can't put in a different file and get [UNINTELLIGIBLE]. 762 00:36:54,760 --> 00:36:56,940 And also don't know the run-time conditions because 763 00:36:56,940 --> 00:36:57,760 that's run-time. 764 00:36:57,760 --> 00:37:04,390 It might be having different inputs, different size of load 765 00:37:04,390 --> 00:37:06,100 and stuff, that I don't know any of those things. 766 00:37:06,100 --> 00:37:08,200 And also, I don't know about the architecture, so if my 767 00:37:08,200 --> 00:37:12,750 compiler have to make sure that it works on AMD machines, 768 00:37:12,750 --> 00:37:14,850 Intel machines, stuff like that, of course, you can use 769 00:37:14,850 --> 00:37:18,010 special flags and try to comply for one machine, and 770 00:37:18,010 --> 00:37:19,760 breaks, you finish on the other one. 771 00:37:19,760 --> 00:37:21,470 But you don't want to do that, so the compiler has to be a 772 00:37:21,470 --> 00:37:24,870 lot more general, and this can be sometimes problematic. 773 00:37:24,870 --> 00:37:28,070 So when you're going to link, the nice thing is that this is 774 00:37:28,070 --> 00:37:30,370 a place you have the entire program available. 775 00:37:30,370 --> 00:37:32,510 Sometimes, people try to do things like inlining in the 776 00:37:32,510 --> 00:37:34,940 linktime, because that means you know everything in there, 777 00:37:34,940 --> 00:37:37,010 so you went with couple different file I can inline it 778 00:37:37,010 --> 00:37:38,650 because I have access through here. 779 00:37:38,650 --> 00:37:43,430 And still, there might be things that's not available, 780 00:37:43,430 --> 00:37:45,930 like dynamically-loaded classes and dynamic-loaded 781 00:37:45,930 --> 00:37:48,960 data, and so things like Java might not be available. 782 00:37:48,960 --> 00:37:50,150 And of course, you don't have access to 783 00:37:50,150 --> 00:37:51,636 source most of the time. 784 00:37:51,636 --> 00:37:53,124 AUDIENCE: Sorry, sir. 785 00:37:53,124 --> 00:37:54,612 What do you mean [INAUDIBLE]? 786 00:37:54,612 --> 00:37:58,580 Do you have the full program [INAUDIBLE]? 787 00:37:58,580 --> 00:38:02,560 But so how do you say that [INAUDIBLE]? 788 00:38:02,560 --> 00:38:04,760 PROFESSOR: So dynamic links, if you have something like 789 00:38:04,760 --> 00:38:07,280 Java, there might be some data that kind of get dynamically 790 00:38:07,280 --> 00:38:09,780 generated or dynamically linked. 791 00:38:09,780 --> 00:38:11,980 So when you're running, if you're running right 792 00:38:11,980 --> 00:38:16,060 [UNINTELLIGIBLE] your browser, all those Javascript classes 793 00:38:16,060 --> 00:38:17,910 and stuff like that, you don't have access to because those 794 00:38:17,910 --> 00:38:18,990 are coming in here. 795 00:38:18,990 --> 00:38:21,350 So there might be places, things that 796 00:38:21,350 --> 00:38:23,030 it gets as it runs. 797 00:38:23,030 --> 00:38:26,370 Not in C, but in other languages. 798 00:38:26,370 --> 00:38:29,070 And the load is interesting time. 799 00:38:29,070 --> 00:38:32,090 Here, load time is important because when you double-click, 800 00:38:32,090 --> 00:38:33,710 you want your program to appear fast. 801 00:38:33,710 --> 00:38:36,120 You don't want it to take a long time. 802 00:38:36,120 --> 00:38:42,270 But you have kind of access to all that code in here, and you 803 00:38:42,270 --> 00:38:44,700 have some idea about the run-time, also, the 804 00:38:44,700 --> 00:38:47,720 architecture, and stuff like that, what you have, not the 805 00:38:47,720 --> 00:38:49,500 run-time, but the architecture, exactly what 806 00:38:49,500 --> 00:38:50,580 machines you are running. 807 00:38:50,580 --> 00:38:52,750 And then, of course, you can do it run-time. 808 00:38:52,750 --> 00:38:54,470 The thing about run-time is you have full knowledge of 809 00:38:54,470 --> 00:38:58,560 everything, it's great, but every clock cycle you spend 810 00:38:58,560 --> 00:39:00,380 optimizing is one clock cycle you take 811 00:39:00,380 --> 00:39:01,360 away from the program. 812 00:39:01,360 --> 00:39:04,455 So things like Java JIT compilers, they try to do 813 00:39:04,455 --> 00:39:06,270 minimal things, so very fast things. 814 00:39:06,270 --> 00:39:08,280 It can't do a lot of complicated things because 815 00:39:08,280 --> 00:39:09,950 it's too expensive. 816 00:39:09,950 --> 00:39:14,110 OK, so we're not talking about any of these things any more, 817 00:39:14,110 --> 00:39:17,730 but it's always good to know, as you go about using Python 818 00:39:17,730 --> 00:39:21,210 or Java or JavaScript and stuff like this where is this 819 00:39:21,210 --> 00:39:22,860 thing happening to my code? 820 00:39:22,860 --> 00:39:24,720 Because it might not be all compile-time stuff. 821 00:39:24,720 --> 00:39:26,200 It might be happening at different stages. 822 00:39:26,200 --> 00:39:28,700 So you need to know who's actually mucking with your 823 00:39:28,700 --> 00:39:30,370 code, and know that there are other people who can 824 00:39:30,370 --> 00:39:31,620 muck with your code. 825 00:39:33,470 --> 00:39:37,390 So next, I want to switch into dataflow analysis. 826 00:39:37,390 --> 00:39:40,855 So this is what compilers are good at, and compilers try to 827 00:39:40,855 --> 00:39:41,830 do all the time. 828 00:39:41,830 --> 00:39:46,770 So it's basically compile-time reasoning about run-time 829 00:39:46,770 --> 00:39:48,990 values and variables, or expressions, within the 830 00:39:48,990 --> 00:39:52,310 program at different program points. 831 00:39:52,310 --> 00:39:54,075 OK, so that means compile-time, I need to know I 832 00:39:54,075 --> 00:39:58,400 have this program point, what could it be. 833 00:39:58,400 --> 00:40:00,700 So things like which assignment statement produced 834 00:40:00,700 --> 00:40:03,190 a value or variable that I am using? 835 00:40:03,190 --> 00:40:05,750 OK, if I use a value, who actually created that value? 836 00:40:05,750 --> 00:40:12,090 Or which variable contain values that are no longer 837 00:40:12,090 --> 00:40:13,320 being used by somebody here? 838 00:40:13,320 --> 00:40:16,390 So that means I am trying to analyze the program and watch 839 00:40:16,390 --> 00:40:20,100 the range of values that each variable can have. 840 00:40:20,100 --> 00:40:23,580 So the key thing here is this has to be true for every 841 00:40:23,580 --> 00:40:26,920 possible input at every possible execution. 842 00:40:26,920 --> 00:40:29,340 Normally, [UNINTELLIGIBLE], and this time, I know why my 843 00:40:29,340 --> 00:40:32,270 variable [UNINTELLIGIBLE], but every possible time, this has 844 00:40:32,270 --> 00:40:33,360 to be true. 845 00:40:33,360 --> 00:40:36,940 OK, if there's a condition that something can happen, you 846 00:40:36,940 --> 00:40:39,670 have to make sure that condition is not going to 847 00:40:39,670 --> 00:40:41,520 break your program. 848 00:40:41,520 --> 00:40:44,500 Last thing you want from optimizer is to basically 849 00:40:44,500 --> 00:40:46,870 start producing different results. 850 00:40:46,870 --> 00:40:49,350 Even [UNINTELLIGIBLE], it's not good, so you want a 851 00:40:49,350 --> 00:40:51,990 compile optimizer to kind of produce the same result that 852 00:40:51,990 --> 00:40:53,770 you got without optimizing. 853 00:40:53,770 --> 00:40:57,070 And this is why this has to be [UNINTELLIGIBLE]. 854 00:40:57,070 --> 00:41:00,350 So first, I want to go through a little bit of example, what 855 00:41:00,350 --> 00:41:01,990 kind of things the compiler do. 856 00:41:01,990 --> 00:41:05,290 You probably have seen this in one of the earlier lectures. 857 00:41:05,290 --> 00:41:08,170 We talked about some of this as hand optimizations, but I'm 858 00:41:08,170 --> 00:41:10,695 going to go through some of them by using this program. 859 00:41:16,380 --> 00:41:18,130 It doesn't mean anything what I'm doing here. 860 00:41:18,130 --> 00:41:19,490 I have a loop here. 861 00:41:19,490 --> 00:41:21,880 I'm calculating some function in here. 862 00:41:21,880 --> 00:41:26,080 And then I am adding something else to this x here, and I 863 00:41:26,080 --> 00:41:29,640 have some initializations in here, just something that I 864 00:41:29,640 --> 00:41:30,770 can demonstrate what it does. 865 00:41:30,770 --> 00:41:34,020 So it has no meaning for this one. 866 00:41:34,020 --> 00:41:36,050 And here's the assembly instructions. 867 00:41:36,050 --> 00:41:39,490 I'm not going to go through assembly, but [UNINTELLIGIBLE] 868 00:41:39,490 --> 00:41:41,720 you can actually create and understand why this is 869 00:41:41,720 --> 00:41:43,360 happening in [INAUDIBLE]. 870 00:41:43,360 --> 00:41:47,710 So I [UNINTELLIGIBLE] into two slides. 871 00:41:47,710 --> 00:41:51,780 The first thing you can do is think of constant propagation. 872 00:41:51,780 --> 00:41:57,180 So what it says is for all possible executions, if a 873 00:41:57,180 --> 00:42:00,700 value that has in a variable is the same, and we know that 874 00:42:00,700 --> 00:42:03,220 value, that's a constant. 875 00:42:03,220 --> 00:42:05,060 And I don't have to keep that value in that variable. 876 00:42:05,060 --> 00:42:06,740 I can replace that with a constant. 877 00:42:12,590 --> 00:42:14,600 Sometimes, when you look at dataflow optimization, you can 878 00:42:14,600 --> 00:42:15,880 say this is done. 879 00:42:15,880 --> 00:42:18,330 As a programmer, I will never do that. 880 00:42:18,330 --> 00:42:21,280 This is something you should be doing, for example, have 881 00:42:21,280 --> 00:42:24,100 things like constant variables that constant values or lower. 882 00:42:24,100 --> 00:42:26,500 But sometimes, something looks dumb, but what happens is 883 00:42:26,500 --> 00:42:29,190 sometimes when you're in optimization does, one 884 00:42:29,190 --> 00:42:32,390 optimization might lead to code that looks like. 885 00:42:32,390 --> 00:42:33,230 That can lead to it. 886 00:42:33,230 --> 00:42:36,290 I will show you something sometimes that you might not 887 00:42:36,290 --> 00:42:39,430 find a code that looks dumb, but previous optimization will 888 00:42:39,430 --> 00:42:42,820 leave, or change the code in a way that this optimization can 889 00:42:42,820 --> 00:42:44,920 take advantage of. 890 00:42:44,920 --> 00:42:49,150 So nice thing about this is you don't need to keep values 891 00:42:49,150 --> 00:42:51,800 in the variables because you can free some variable, that 892 00:42:51,800 --> 00:42:53,582 means free RAM registers. 893 00:42:53,582 --> 00:42:56,456 Also, most of the time when you do constant propagation it 894 00:42:56,456 --> 00:42:59,130 leads to [UNINTELLIGIBLE] optimizations. 895 00:42:59,130 --> 00:43:02,320 So in this program what are the things that can be 896 00:43:02,320 --> 00:43:03,650 constant propagated? 897 00:43:03,650 --> 00:43:09,700 So we know x equals 0, x's are constant up to this point. 898 00:43:09,700 --> 00:43:16,760 But since x get modified here, my dataflow say wait a minute, 899 00:43:16,760 --> 00:43:20,860 I am going through this loop, and x is constant 900 00:43:20,860 --> 00:43:21,670 from here to here. 901 00:43:21,670 --> 00:43:24,230 But after this point, x is not constant because it get 902 00:43:24,230 --> 00:43:25,780 modified in here. 903 00:43:25,780 --> 00:43:28,130 So that's what dataflow is going to say, and so I 904 00:43:28,130 --> 00:43:29,540 can't do that x. 905 00:43:29,540 --> 00:43:31,310 But [UNINTELLIGIBLE] 906 00:43:31,310 --> 00:43:32,870 why it become constant here? 907 00:43:32,870 --> 00:43:35,760 All input that goes into this loop, has to go through here, 908 00:43:35,760 --> 00:43:37,940 becomes constant in every path. 909 00:43:37,940 --> 00:43:41,900 And then it doesn't get modified in this loop at all. 910 00:43:41,900 --> 00:43:45,770 OK, so then I can actually, through constant propagation, 911 00:43:45,770 --> 00:43:47,650 get to the file. 912 00:43:47,650 --> 00:43:50,360 OK, so now I have a program like that. 913 00:43:50,360 --> 00:43:52,880 So normal compiler optimization is done by 914 00:43:52,880 --> 00:43:53,860 pass-by-pass. 915 00:43:53,860 --> 00:43:55,790 A lot of passes get repeated multiple times, so I 916 00:43:55,790 --> 00:43:56,590 leave it like this. 917 00:43:56,590 --> 00:44:01,360 So even though this is just simple thing, but we leave it 918 00:44:01,360 --> 00:44:04,070 to somebody else to optimize that, which is what we call 919 00:44:04,070 --> 00:44:06,020 algebraic simplification. 920 00:44:06,020 --> 00:44:09,460 Basically, it says you go to your, whatever, fourth grade, 921 00:44:09,460 --> 00:44:12,120 fifth grade, sixth grade algebra book-- 922 00:44:12,120 --> 00:44:13,790 I don't know where you learn, somewhere you learned 923 00:44:13,790 --> 00:44:18,180 algebraic --and they have all these very simple rules, like 924 00:44:18,180 --> 00:44:20,980 something multiplied by 0 is 0, multiply 1 by that, and all 925 00:44:20,980 --> 00:44:24,030 of those rules, and then you can just busy code them up and 926 00:44:24,030 --> 00:44:26,460 look for these patterns and replace. 927 00:44:26,460 --> 00:44:28,230 And that's what the compiler does. 928 00:44:28,230 --> 00:44:30,980 And in fact, we look at something like this, a simple 929 00:44:30,980 --> 00:44:33,490 shape, but you saw before that, it do much more 930 00:44:33,490 --> 00:44:35,220 complicated things. 931 00:44:35,220 --> 00:44:38,730 And it's a lot less work at run-time, and also it leads to 932 00:44:38,730 --> 00:44:41,870 more optimization, so it can simplify things in here. 933 00:44:41,870 --> 00:44:46,550 And other thing is, sometimes instead of algebraic 934 00:44:46,550 --> 00:44:50,080 simplification, kind of weird things. 935 00:44:50,080 --> 00:44:54,100 If you want exact precise, for example if you're doing 936 00:44:54,100 --> 00:44:58,270 floating point, because floating point, a plus b plus 937 00:44:58,270 --> 00:45:04,660 c, is not b plus c plus a, are different because you can get 938 00:45:04,660 --> 00:45:09,520 small teeny differences in these kind of-- 939 00:45:09,520 --> 00:45:10,670 [UNINTELLIGIBLE] 940 00:45:10,670 --> 00:45:12,850 and associate duty, and some people care. 941 00:45:12,850 --> 00:45:15,500 Most people don't because it's so small, most people, they 942 00:45:15,500 --> 00:45:16,240 don't care. 943 00:45:16,240 --> 00:45:16,930 Others do. 944 00:45:16,930 --> 00:45:20,010 And also sometimes when you do this optimization, things like 945 00:45:20,010 --> 00:45:24,310 overflow and underflow, that happens because if I do x plus 946 00:45:24,310 --> 00:45:30,340 x minus x, or x is very large, x plus x minus overflow, and 947 00:45:30,340 --> 00:45:32,960 then you end of doing minus x because it overflows. 948 00:45:32,960 --> 00:45:36,120 But instead of x plus x minus x, it's just x, you don't 949 00:45:36,120 --> 00:45:37,450 overflow anymore. 950 00:45:37,450 --> 00:45:39,880 So you have changed the behavior of the program, but 951 00:45:39,880 --> 00:45:42,420 most of the time, compilers think that things like that 952 00:45:42,420 --> 00:45:46,020 are special cases. 953 00:45:46,020 --> 00:45:48,700 They are not the normal behavior, so changing them is 954 00:45:48,700 --> 00:45:49,290 probably OK. 955 00:45:49,290 --> 00:45:53,250 Sometimes, you can't do anything. 956 00:45:53,250 --> 00:45:55,400 So now here, what are algebraic 957 00:45:55,400 --> 00:45:56,650 simplification I can do? 958 00:46:05,640 --> 00:46:06,890 What can I do here? 959 00:46:13,320 --> 00:46:15,540 Yeah, I multiply by 0, [UNINTELLIGIBLE] 960 00:46:15,540 --> 00:46:16,250 this, that. 961 00:46:16,250 --> 00:46:18,880 At 0, I leave it here, and then there's another algebraic 962 00:46:18,880 --> 00:46:22,570 simplification, I can do that, but now, I am leaving it here 963 00:46:22,570 --> 00:46:24,960 because there's no algebraic simplification. 964 00:46:24,960 --> 00:46:26,720 X equals x is-- 965 00:46:26,720 --> 00:46:27,610 there's nothing you can do. 966 00:46:27,610 --> 00:46:30,530 That's called copy propagation. 967 00:46:30,530 --> 00:46:33,370 Copy propagation says you're just making a copy of one 968 00:46:33,370 --> 00:46:35,940 value to another, just get another copy. 969 00:46:35,940 --> 00:46:37,160 You don't need to do a copy. 970 00:46:37,160 --> 00:46:38,820 Very simple thing in here. 971 00:46:38,820 --> 00:46:40,750 Less instructions, less memory registers 972 00:46:40,750 --> 00:46:42,180 because we are not copying. 973 00:46:42,180 --> 00:46:44,300 However, when we [UNINTELLIGIBLE] 974 00:46:44,300 --> 00:46:47,410 register location, I will talk, basically. 975 00:46:47,410 --> 00:46:51,710 If I use the same register now, I might have things that 976 00:46:51,710 --> 00:46:55,300 was in two registers, x copied to y, now it's all in x. 977 00:46:55,300 --> 00:46:59,330 So that means I might have some variable in the register 978 00:46:59,330 --> 00:47:02,850 that you call my interference graph. 979 00:47:02,850 --> 00:47:05,440 I'll talk about this in a little while, so I'm just 980 00:47:05,440 --> 00:47:06,710 forward referencing. 981 00:47:06,710 --> 00:47:11,440 That might not be easily register locatable. 982 00:47:11,440 --> 00:47:12,600 And so in here, x equals x. 983 00:47:12,600 --> 00:47:14,680 I can get rid of that. 984 00:47:14,680 --> 00:47:16,840 And another interesting thing is common subexpression 985 00:47:16,840 --> 00:47:17,620 elimination. 986 00:47:17,620 --> 00:47:19,460 If you do the same thing multiple times, you calculate 987 00:47:19,460 --> 00:47:23,550 it once, less computation, Cons is you need to keep this 988 00:47:23,550 --> 00:47:26,220 result somewhere between the two users. 989 00:47:26,220 --> 00:47:28,700 So if I have too many of these things, I might just run out 990 00:47:28,700 --> 00:47:31,470 of registers to keep these values calculated. 991 00:47:31,470 --> 00:47:34,340 And also interesting thing is, this can hinder things like 992 00:47:34,340 --> 00:47:36,500 parallelization. 993 00:47:36,500 --> 00:47:38,790 When we get there, we can see that by adding additional 994 00:47:38,790 --> 00:47:40,050 dependencies in there. 995 00:47:40,050 --> 00:47:42,200 So in here, what are the common expressions? 996 00:47:50,813 --> 00:47:55,560 Either you guys are bored, or this slide is way too hard. 997 00:47:55,560 --> 00:47:56,340 You're bored? 998 00:47:56,340 --> 00:47:57,640 AUDIENCE: [INAUDIBLE]. 999 00:47:57,640 --> 00:47:59,000 PROFESSOR: y plus 1, OK, good. 1000 00:47:59,000 --> 00:48:02,020 So there's y plus 1 in here, and I can calculate it once, 1001 00:48:02,020 --> 00:48:05,600 and then I can just do the multiplication of that, and do 1002 00:48:05,600 --> 00:48:06,360 that, and voila. 1003 00:48:06,360 --> 00:48:11,000 It got rid of two addition and one multiplication to one 1004 00:48:11,000 --> 00:48:13,530 addition and one multiplication. 1005 00:48:13,530 --> 00:48:16,290 OK, next thing is dead code elimination. 1006 00:48:16,290 --> 00:48:19,150 So if you're doing something that nobody's using the value, 1007 00:48:19,150 --> 00:48:20,560 why do you do it? 1008 00:48:20,560 --> 00:48:24,860 And less computation, and maybe you release storage 1009 00:48:24,860 --> 00:48:27,330 because you're not storing these values your computing, 1010 00:48:27,330 --> 00:48:29,510 and that's really nice. 1011 00:48:29,510 --> 00:48:32,740 And there's not much of bad things about dead code. 1012 00:48:32,740 --> 00:48:33,860 Dead code is pretty dead. 1013 00:48:33,860 --> 00:48:35,290 You can get rid of it. 1014 00:48:35,290 --> 00:48:36,540 So here, what are the dead code you have? 1015 00:48:41,310 --> 00:48:45,450 I want keep you at least somewhat engaged, so see if 1016 00:48:45,450 --> 00:48:47,116 you can find my dead code. 1017 00:48:47,116 --> 00:48:47,850 AUDIENCE: y. 1018 00:48:47,850 --> 00:48:48,950 PROFESSOR: y, yeah. 1019 00:48:48,950 --> 00:48:50,120 I got rid of [UNINTELLIGIBLE]. 1020 00:48:50,120 --> 00:48:52,340 Now, I don't need it, I can just get rid of that, and then 1021 00:48:52,340 --> 00:48:54,420 I can even get rid of allocating y. 1022 00:48:54,420 --> 00:48:57,190 So I got rid of both instruction and some 1023 00:48:57,190 --> 00:49:01,830 memory-allocated registry that used to keep that value there. 1024 00:49:01,830 --> 00:49:04,080 Another interesting thing you can do is loop invariant code 1025 00:49:04,080 --> 00:49:06,830 [UNINTELLIGIBLE] because loops are very important. 1026 00:49:06,830 --> 00:49:09,640 Most of execution time is mainly inside loops, so if you 1027 00:49:09,640 --> 00:49:11,580 can get something out of a loop, that's really good. 1028 00:49:11,580 --> 00:49:15,360 We talked about that previously. 1029 00:49:15,360 --> 00:49:20,140 But you have to worry about, basically, two things. 1030 00:49:20,140 --> 00:49:22,870 One thing is that when you move too many things out of 1031 00:49:22,870 --> 00:49:25,930 the loops, you have to keep all those values in registers, 1032 00:49:25,930 --> 00:49:28,420 so that means you need more registers inside the loop. 1033 00:49:28,420 --> 00:49:31,370 Second thing is when you execute that, you have to make 1034 00:49:31,370 --> 00:49:36,270 sure that it have the same behavior as 1035 00:49:36,270 --> 00:49:37,750 when you run the program. 1036 00:49:37,750 --> 00:49:40,780 How about special cases, the loop never get executed. 1037 00:49:45,380 --> 00:49:47,090 First let's look at this. 1038 00:49:47,090 --> 00:49:50,360 What other loop invariant expressions in here? 1039 00:49:50,360 --> 00:49:51,650 AUDIENCE: [INAUDIBLE]. 1040 00:49:51,650 --> 00:49:52,080 PROFESSOR: Hm? 1041 00:49:52,080 --> 00:49:53,120 AUDIENCE: 4 times [INAUDIBLE]-- 1042 00:49:53,120 --> 00:49:55,190 PROFESSOR: 4 times eta a divided by b. 1043 00:49:55,190 --> 00:49:58,370 OK, good, I just moved up there. 1044 00:49:58,370 --> 00:50:03,210 So I did that, but why am I really wrong? 1045 00:50:03,210 --> 00:50:05,040 Why won't the compiler do this? 1046 00:50:07,720 --> 00:50:11,630 Give me a case that this would change the program behavior. 1047 00:50:26,450 --> 00:50:27,700 AUDIENCE: [INAUDIBLE]. 1048 00:50:31,290 --> 00:50:33,800 PROFESSOR: 4 times overflow, yeah, that can happen. 1049 00:50:33,800 --> 00:50:37,890 That's one case, but there's something that can happen-- 1050 00:50:37,890 --> 00:50:40,220 overflow happens in very large numbers, people don't care 1051 00:50:40,220 --> 00:50:41,700 that much, but there's something that can 1052 00:50:41,700 --> 00:50:43,374 happen a lot more. 1053 00:50:43,374 --> 00:50:46,090 AUDIENCE: If B is 0, and then N is less than 0? 1054 00:50:46,090 --> 00:50:47,980 PROFESSOR: Exactly, when B[0] 1055 00:50:47,980 --> 00:50:49,770 and N is less than 0. 1056 00:50:49,770 --> 00:50:54,860 I am going to have a divide by 0 error in here because I am 1057 00:50:54,860 --> 00:50:56,620 going here, dividing by 0. 1058 00:50:56,620 --> 00:50:59,190 That would have never happened because the loop wouldn't have 1059 00:50:59,190 --> 00:51:00,970 gone and executed it. 1060 00:51:00,970 --> 00:51:05,090 So normally, when you do things like that in a loop, 1061 00:51:05,090 --> 00:51:08,660 the compiler generate a place called a landing pad, which 1062 00:51:08,660 --> 00:51:13,480 basically is, before you enter the loop, you check whether 1063 00:51:13,480 --> 00:51:15,340 the loop will ever get executed. 1064 00:51:15,340 --> 00:51:18,230 And then go to the landing pad, and then go to the loop. 1065 00:51:18,230 --> 00:51:21,430 So the landing pad will be run only when the loop at least 1066 00:51:21,430 --> 00:51:24,380 has one iteration running, and so you can move all those 1067 00:51:24,380 --> 00:51:25,230 thing in the landing pad. 1068 00:51:25,230 --> 00:51:26,960 So here, you can see there's no landing pad. 1069 00:51:26,960 --> 00:51:29,540 The code generated probably would have, and so I did 1070 00:51:29,540 --> 00:51:33,800 something that is, you would see in fact, the optimized 1071 00:51:33,800 --> 00:51:35,570 code I did didn't do that. 1072 00:51:35,570 --> 00:51:38,030 So GCC minus [UNINTELLIGIBLE] is smart 1073 00:51:38,030 --> 00:51:40,320 enough not to do this. 1074 00:51:40,320 --> 00:51:43,360 So then there's another type of strength reduction, which 1075 00:51:43,360 --> 00:51:50,170 is saying if I go something like a times i, what I can do 1076 00:51:50,170 --> 00:51:55,290 is just, instead of doing a times i, I can basically make 1077 00:51:55,290 --> 00:51:58,090 the first iteration initialize it, and every time you can 1078 00:51:58,090 --> 00:52:00,280 update the previous value. 1079 00:52:00,280 --> 00:52:03,330 OK, so array times i, the first it's 0 and next time 1080 00:52:03,330 --> 00:52:06,670 it'll be t plus 80 plus this, so I can keep updating that. 1081 00:52:06,670 --> 00:52:08,000 OK, so this is really good. 1082 00:52:08,000 --> 00:52:10,710 I have this computation because now we just sort of 1083 00:52:10,710 --> 00:52:12,670 multiply, I just made it add. 1084 00:52:12,670 --> 00:52:15,160 But I have a lot of problems that can happen here. 1085 00:52:15,160 --> 00:52:18,200 First of all, I have, now, this one. 1086 00:52:18,200 --> 00:52:20,220 I didn't have to keep this value anywhere, only 1087 00:52:20,220 --> 00:52:21,450 when I needed it. 1088 00:52:21,450 --> 00:52:24,730 In here, this value has to be varied through out the entire 1089 00:52:24,730 --> 00:52:27,900 loop because I keep updating that value, so I created 1090 00:52:27,900 --> 00:52:30,070 another need for a register. 1091 00:52:30,070 --> 00:52:31,610 Before now, I only needed it at that point. 1092 00:52:31,610 --> 00:52:33,650 I could've [UNINTELLIGIBLE], rarely used it, but now it 1093 00:52:33,650 --> 00:52:35,930 just to be there throughout the program 1094 00:52:35,930 --> 00:52:38,290 I created in there. 1095 00:52:38,290 --> 00:52:40,660 Also what I fear is what they call a loop-carried 1096 00:52:40,660 --> 00:52:41,260 dependence. 1097 00:52:41,260 --> 00:52:43,490 Every time you run iteration, you use the previous 1098 00:52:43,490 --> 00:52:44,710 iteration's value. 1099 00:52:44,710 --> 00:52:47,840 When we go into a parallelizing loop, you 1100 00:52:47,840 --> 00:52:50,200 suddenly realize that means I can't run them parallel, so 1101 00:52:50,200 --> 00:52:52,760 this creates a huge problem in parallelization. 1102 00:52:52,760 --> 00:52:56,770 So you do [UNINTELLIGIBLE], strength increase, when you go 1103 00:52:56,770 --> 00:52:57,280 to parallelize. 1104 00:52:57,280 --> 00:52:59,450 And you can undo these things. 1105 00:52:59,450 --> 00:53:02,890 So in here, one thing you can do is you look at something 1106 00:53:02,890 --> 00:53:04,780 like u times i and say wait a minute, I don't have to 1107 00:53:04,780 --> 00:53:07,170 multiply by i because it [UNINTELLIGIBLE] 1108 00:53:07,170 --> 00:53:12,220 0 like this, and I can just allocate a value b and keep it 1109 00:53:12,220 --> 00:53:15,810 updating by v, and then I did that, allocated a variable in 1110 00:53:15,810 --> 00:53:17,210 here, allocated 0. 1111 00:53:17,210 --> 00:53:20,960 And [UNINTELLIGIBLE] this, I just basically put v times 0. 1112 00:53:20,960 --> 00:53:22,240 You see that? 1113 00:53:22,240 --> 00:53:24,835 I just basically got rid of a multiplication and convert it 1114 00:53:24,835 --> 00:53:28,990 into addition, but I paid some cost by, I need now this 1115 00:53:28,990 --> 00:53:31,415 additional register that is true all 1116 00:53:31,415 --> 00:53:32,390 throughout the entire thing. 1117 00:53:32,390 --> 00:53:33,320 [UNINTELLIGIBLE] 1118 00:53:33,320 --> 00:53:34,610 I just calculated that expression. 1119 00:53:37,660 --> 00:53:41,650 And the big thing a lot you get performances register 1120 00:53:41,650 --> 00:53:45,030 allocation, so most processes have very few registers. 1121 00:53:45,030 --> 00:53:47,160 In fact, one big change that when you went from 1122 00:53:47,160 --> 00:53:48,930 [UNINTELLIGIBLE] 1123 00:53:48,930 --> 00:53:50,190 is to get additional registers. 1124 00:53:50,190 --> 00:53:51,440 Registers are very important. 1125 00:53:55,740 --> 00:53:58,340 I will go through register location a little bit. 1126 00:53:58,340 --> 00:54:01,870 So what happens is when you have a program, you have a 1127 00:54:01,870 --> 00:54:02,920 control goes like this. 1128 00:54:02,920 --> 00:54:05,300 So this control going, executing something that 1129 00:54:05,300 --> 00:54:08,840 defines this variable x, defines variable y. 1130 00:54:08,840 --> 00:54:12,210 And here, you use variable x and variable y, and there are 1131 00:54:12,210 --> 00:54:13,790 different paths the program can go through. 1132 00:54:13,790 --> 00:54:16,110 There are two paths can merge in here, here, you 1133 00:54:16,110 --> 00:54:17,730 can expand in here. 1134 00:54:17,730 --> 00:54:21,220 So this is kind of the flow of the program in a small part. 1135 00:54:21,220 --> 00:54:23,540 So what you can say is this definition [UNINTELLIGIBLE] 1136 00:54:23,540 --> 00:54:27,610 here, so this value, this line in between here-- 1137 00:54:27,610 --> 00:54:28,610 because you can't get rid of it. 1138 00:54:28,610 --> 00:54:30,420 When you decided you had to keep it somewhere because 1139 00:54:30,420 --> 00:54:32,310 somebody's going to [UNINTELLIGIBLE]. 1140 00:54:32,310 --> 00:54:35,220 And this definition is used here, so when you decided you 1141 00:54:35,220 --> 00:54:37,650 had to be [UNINTELLIGIBLE] in here. 1142 00:54:37,650 --> 00:54:39,290 When you [UNINTELLIGIBLE] 1143 00:54:39,290 --> 00:54:40,060 is only used here. 1144 00:54:40,060 --> 00:54:43,670 Nobody uses here, so this has to be [UNINTELLIGIBLE]. 1145 00:54:43,670 --> 00:54:46,670 Interesting thing about x is there are two definitions of x 1146 00:54:46,670 --> 00:54:49,590 that might be used here, and this definition might be used 1147 00:54:49,590 --> 00:54:50,360 here or here. 1148 00:54:50,360 --> 00:54:56,230 So you put all this into what they call a one web because 1149 00:54:56,230 --> 00:54:57,730 these two definitions might-- 1150 00:54:57,730 --> 00:54:59,100 either one of them will be used here. 1151 00:54:59,100 --> 00:55:02,070 This definition will be used, either one, over here, so this 1152 00:55:02,070 --> 00:55:02,980 value has to be [UNINTELLIGIBLE] 1153 00:55:02,980 --> 00:55:05,340 in here, kept somewhere. 1154 00:55:05,340 --> 00:55:10,430 So then what we say is we give names to these, 1155 00:55:10,430 --> 00:55:12,610 so this is s1, s2. 1156 00:55:12,610 --> 00:55:14,100 Somebody has to keep this value. 1157 00:55:14,100 --> 00:55:15,820 s2 keeps this value, s3 keeps this value, 1158 00:55:15,820 --> 00:55:17,390 s4 keeps this value. 1159 00:55:17,390 --> 00:55:21,040 The interesting thing is how many registers you need to 1160 00:55:21,040 --> 00:55:23,110 keep all those values. 1161 00:55:23,110 --> 00:55:25,700 That's why the entire thing of register allocation. 1162 00:55:25,700 --> 00:55:30,430 So what you do is this really cute mapping of this into nice 1163 00:55:30,430 --> 00:55:31,900 theoretical problem. 1164 00:55:31,900 --> 00:55:36,800 So what you can say is each of these regions, we make it 1165 00:55:36,800 --> 00:55:38,050 vertex of a graph. 1166 00:55:40,680 --> 00:55:44,450 If these regions overlap, then we get edge. 1167 00:55:44,450 --> 00:55:47,620 S1 and s2 overlap. 1168 00:55:47,620 --> 00:55:51,250 That means you can't use the same register to keep s1 one 1169 00:55:51,250 --> 00:55:56,750 and s2 because before s1 is finished using, s2 has to be 1170 00:55:56,750 --> 00:55:57,905 free of that. 1171 00:55:57,905 --> 00:56:02,400 OK, there's overlap in here, so we've created edge in here. 1172 00:56:02,400 --> 00:56:04,350 OK, s2 and s3. 1173 00:56:04,350 --> 00:56:07,530 So s2 and s3 overlaps here because at this point, both 1174 00:56:07,530 --> 00:56:09,520 value s2 and s3 has to be kept. 1175 00:56:09,520 --> 00:56:12,270 So I create an edge in here. 1176 00:56:12,270 --> 00:56:13,550 OK, so I create an edge. 1177 00:56:13,550 --> 00:56:15,370 Every time I say those do values 1178 00:56:15,370 --> 00:56:17,220 need separate registers. 1179 00:56:17,220 --> 00:56:19,020 I can't keep the same register. 1180 00:56:19,020 --> 00:56:20,800 And of course, s3 and s4. 1181 00:56:20,800 --> 00:56:22,210 s3 is here. 1182 00:56:22,210 --> 00:56:26,150 s4 can be in the same register. 1183 00:56:26,150 --> 00:56:27,320 So there's no edge here. 1184 00:56:27,320 --> 00:56:29,520 os1 and s4 can be in the same register. 1185 00:56:29,520 --> 00:56:33,260 os2 and s4 can be in the same register because they are not 1186 00:56:33,260 --> 00:56:35,430 live at the same time. 1187 00:56:35,430 --> 00:56:38,950 They are live at a different time of the program execution. 1188 00:56:38,950 --> 00:56:43,900 Now, what you can do is, you have a graph, you have edges, 1189 00:56:43,900 --> 00:56:45,750 and there's this very famous problem called 1190 00:56:45,750 --> 00:56:46,775 graph coloring problem. 1191 00:56:46,775 --> 00:56:49,520 How many heard of graph coloring problem? 1192 00:56:49,520 --> 00:56:50,380 OK, good. 1193 00:56:50,380 --> 00:56:53,420 So what happens is now we can figure out how many colors 1194 00:56:53,420 --> 00:56:56,360 need to color this graph, and that is the number of colors 1195 00:56:56,360 --> 00:56:58,220 of the number of registers you need. 1196 00:56:58,220 --> 00:57:04,710 So if you have a graph like this with no edges, you can 1197 00:57:04,710 --> 00:57:07,430 color it with one color. 1198 00:57:07,430 --> 00:57:09,816 How many colors for this one? 1199 00:57:09,816 --> 00:57:11,500 Two colors. 1200 00:57:11,500 --> 00:57:12,750 How many colors for this one? 1201 00:57:15,280 --> 00:57:16,040 People said two colors. 1202 00:57:16,040 --> 00:57:17,440 Yes, you can color it with two colors. 1203 00:57:17,440 --> 00:57:18,690 How many colors for this one? 1204 00:57:21,780 --> 00:57:23,700 AUDIENCE: [INAUDIBLE] 1205 00:57:23,700 --> 00:57:25,720 PROFESSOR: It's three-color [UNINTELLIGIBLE]. 1206 00:57:25,720 --> 00:57:27,940 So there's all these algorithms [UNINTELLIGIBLE] 1207 00:57:27,940 --> 00:57:28,820 and say no. 1208 00:57:28,820 --> 00:57:32,090 You can see by coloring this how many registers I need. 1209 00:57:32,090 --> 00:57:36,220 And the interesting is, if you need more colors than the 1210 00:57:36,220 --> 00:57:39,040 register you have, that means you can't register allocate, 1211 00:57:39,040 --> 00:57:41,380 and at that point, you need too many things to keep. 1212 00:57:41,380 --> 00:57:42,740 You don't have that many registers and that 1213 00:57:42,740 --> 00:57:44,790 [UNINTELLIGIBLE]. 1214 00:57:44,790 --> 00:57:47,790 That means you take edge and say, ah-hah, I can't keep both 1215 00:57:47,790 --> 00:57:48,940 of these guys in the same. 1216 00:57:48,940 --> 00:57:53,355 I will take some vertex out and say this vertex can't be 1217 00:57:53,355 --> 00:57:55,340 in there because I can't put it into register, and 1218 00:57:55,340 --> 00:57:56,500 I spill this out. 1219 00:57:56,500 --> 00:57:57,550 And you can re-color the graphs. 1220 00:57:57,550 --> 00:57:59,960 You can spill it, and of course, spilling is costly 1221 00:57:59,960 --> 00:58:02,250 because now [UNINTELLIGIBLE] value in the register, it's in 1222 00:58:02,250 --> 00:58:04,570 the memory, so every time you need it, you had to bring it 1223 00:58:04,570 --> 00:58:08,110 back, send it back, so it's going to be expensive. 1224 00:58:08,110 --> 00:58:09,890 The nice thing is to see how much you 1225 00:58:09,890 --> 00:58:12,800 can keep in the register. 1226 00:58:12,800 --> 00:58:15,230 So I have enough registers for this program, so I found 1227 00:58:15,230 --> 00:58:16,510 registers for all these things instead of 1228 00:58:16,510 --> 00:58:18,260 putting it in memory. 1229 00:58:18,260 --> 00:58:20,120 And now, this is [UNINTELLIGIBLE] 1230 00:58:20,120 --> 00:58:24,080 register allocation in a pseudo C code, so this is the 1231 00:58:24,080 --> 00:58:28,050 kind of optimized code, and this is the generated-- 1232 00:58:28,050 --> 00:58:31,000 Basically, all four of the original program generated. 1233 00:58:31,000 --> 00:58:34,570 But in here, I move this one up. 1234 00:58:34,570 --> 00:58:38,420 But in this one, actually, the division didn't get moved up, 1235 00:58:38,420 --> 00:58:41,740 so [UNINTELLIGIBLE] actually inside the loop because it's 1236 00:58:41,740 --> 00:58:43,090 realized you can't do that. 1237 00:58:43,090 --> 00:58:46,190 But interestingly moved the multiplication out, so that it 1238 00:58:46,190 --> 00:58:48,180 didn't care about the overflow. 1239 00:58:48,180 --> 00:58:52,110 It says, hey, overflow, it can have an overflow, but it will 1240 00:58:52,110 --> 00:58:56,360 worry more about divide by 0. 1241 00:58:56,360 --> 00:58:57,590 OK? 1242 00:58:57,590 --> 00:58:58,840 Any questions so far? 1243 00:59:04,210 --> 00:59:07,010 So here's the optimized code, and if you run it, there's 1244 00:59:07,010 --> 00:59:08,580 seconds versus 54 seconds. 1245 00:59:08,580 --> 00:59:10,740 Just GCC [UNINTELLIGIBLE] 1246 00:59:10,740 --> 00:59:18,370 0, GCC os, so it'll produce very compact optimized code. 1247 00:59:18,370 --> 00:59:20,880 So the key thing is what's [UNINTELLIGIBLE] these 1248 00:59:20,880 --> 00:59:22,990 optimizations. 1249 00:59:22,990 --> 00:59:27,060 The key thing is you have to guarantee, when you optimize, 1250 00:59:27,060 --> 00:59:30,460 all that these programs [UNINTELLIGIBLE] 1251 00:59:30,460 --> 00:59:33,802 from unoptimized, optimized, all the valid input, all the 1252 00:59:33,802 --> 00:59:36,310 valid execution, and all valid architecture that you're 1253 00:59:36,310 --> 00:59:37,970 supposed to run, you can't do the same thing. 1254 00:59:37,970 --> 00:59:40,340 Otherwise, it's not a good optimizer if it does different 1255 00:59:40,340 --> 00:59:41,380 things to code. 1256 00:59:41,380 --> 00:59:43,985 So there are a lot of things that means you have to be very 1257 00:59:43,985 --> 00:59:44,960 conservative in [UNINTELLIGIBLE] 1258 00:59:44,960 --> 00:59:45,600 cases. 1259 00:59:45,600 --> 00:59:49,040 So you have to understand both control-flow and data 1260 00:59:49,040 --> 00:59:52,630 accesses, and make sure that you understand them, and if 1261 00:59:52,630 --> 00:59:55,280 any of them, the compile-time analysis cannot understand, 1262 00:59:55,280 --> 00:59:59,170 the compiler give up very fast. 1263 00:59:59,170 --> 01:00:03,790 So the thing is, most of the time if that information is 1264 01:00:03,790 --> 01:00:07,300 not available, compilers reduce the scope of the region 1265 01:00:07,300 --> 01:00:07,570 [UNINTELLIGIBLE] 1266 01:00:07,570 --> 01:00:08,260 the transformation. 1267 01:00:08,260 --> 01:00:10,100 So we have this point, I don't know beyond that. 1268 01:00:10,100 --> 01:00:12,820 I can only do a small amount of transformations here. 1269 01:00:12,820 --> 01:00:15,570 Or reduce the aggressiveness of transformations, and 1270 01:00:15,570 --> 01:00:18,610 sometimes just completely leave code alone as it is 1271 01:00:18,610 --> 01:00:22,850 because it couldn't, even the things you know, no sane 1272 01:00:22,850 --> 01:00:24,290 program would do, and of course, your 1273 01:00:24,290 --> 01:00:25,560 code will never do. 1274 01:00:25,560 --> 01:00:27,980 The compiler assume, if it is a valid C 1275 01:00:27,980 --> 01:00:30,270 semantics, it might happen. 1276 01:00:30,270 --> 01:00:32,940 Even though some of them looked really crazy. 1277 01:00:32,940 --> 01:00:35,450 If it is a valid possible way of doing it, compiler has to 1278 01:00:35,450 --> 01:00:37,240 worry about it, and not do that. 1279 01:00:37,240 --> 01:00:39,070 So it's here to be careful of that. 1280 01:00:39,070 --> 01:00:41,630 So first of all, control-flow. 1281 01:00:41,630 --> 01:00:44,885 That means it doesn't work on possible paths of the program 1282 01:00:44,885 --> 01:00:46,580 when you execute that. 1283 01:00:46,580 --> 01:00:49,720 And the way you look at this, you can add this call graphs 1284 01:00:49,720 --> 01:00:51,260 in the high-level [UNINTELLIGIBLE] the call in 1285 01:00:51,260 --> 01:00:54,640 here, and control-flow graphs within the metadata function 1286 01:00:54,640 --> 01:00:57,140 how control goes from. 1287 01:00:57,140 --> 01:01:00,280 And what makes it hard for compiler to analysis this? 1288 01:01:00,280 --> 01:01:01,120 Bunch of things [UNINTELLIGIBLE] 1289 01:01:01,120 --> 01:01:02,300 function pointers. 1290 01:01:02,300 --> 01:01:04,100 You probably haven't done function pointers, but if you 1291 01:01:04,100 --> 01:01:06,690 have function pointers in the compiler concepts, I don't 1292 01:01:06,690 --> 01:01:07,380 know where it's going. 1293 01:01:07,380 --> 01:01:08,640 I have to be very careful. 1294 01:01:08,640 --> 01:01:10,270 Indirect branches. 1295 01:01:10,270 --> 01:01:13,480 so I keep addresses somewhere in that branch, so that I 1296 01:01:13,480 --> 01:01:15,010 don't know where it's going. 1297 01:01:15,010 --> 01:01:17,270 Something computed go to [UNINTELLIGIBLE]. 1298 01:01:17,270 --> 01:01:19,240 Large switch statement. 1299 01:01:19,240 --> 01:01:19,910 It's just spaghetti code. 1300 01:01:19,910 --> 01:01:22,290 We have no idea where it would end up and compile at us, and 1301 01:01:22,290 --> 01:01:24,190 we can't get anywhere in this switch statement. 1302 01:01:24,190 --> 01:01:26,190 Either [UNINTELLIGIBLE] you might know some order of going 1303 01:01:26,190 --> 01:01:28,070 through that, it doesn't work. 1304 01:01:28,070 --> 01:01:29,740 If you are looped with [UNINTELLIGIBLE] 1305 01:01:29,740 --> 01:01:32,410 breaks and very complex things in the compiler, sometimes 1306 01:01:32,410 --> 01:01:33,710 it'll give up. 1307 01:01:33,710 --> 01:01:36,572 When the loop bounds are unknown, you'd assume it could 1308 01:01:36,572 --> 01:01:37,410 be anything. 1309 01:01:37,410 --> 01:01:40,210 Whereas when loop bounds are known, as you saw in the first 1310 01:01:40,210 --> 01:01:43,900 set of examples, you can take advantages a lot more, and you 1311 01:01:43,900 --> 01:01:47,390 can do a lot more aggressive things, or not care about 1312 01:01:47,390 --> 01:01:48,690 cases because I know that. 1313 01:01:48,690 --> 01:01:50,390 But in this unknown loop bounds, you have to be a lot 1314 01:01:50,390 --> 01:01:53,700 more careful of that. 1315 01:01:53,700 --> 01:01:56,000 And conditions where branch is not analyzable. 1316 01:01:56,000 --> 01:01:57,770 So if you have branch condition, if you don't know 1317 01:01:57,770 --> 01:02:00,390 what's happening in the branch, I might not be able to 1318 01:02:00,390 --> 01:02:03,680 take advantages or think how to do the branch well. 1319 01:02:03,680 --> 01:02:07,270 So those are the things that I have to worry about. 1320 01:02:07,270 --> 01:02:10,850 The other thing is data accessors, so that means who 1321 01:02:10,850 --> 01:02:12,570 else can read and write the data. 1322 01:02:12,570 --> 01:02:17,700 So I am touching the data item, and I need to know that, 1323 01:02:17,700 --> 01:02:20,620 between the two points I am looking at the data, nobody 1324 01:02:20,620 --> 01:02:23,920 else go and muck with my data, or use my data. 1325 01:02:23,920 --> 01:02:26,780 Because when I look at the data, [UNINTELLIGIBLE] 1326 01:02:26,780 --> 01:02:29,460 something, I want to make sure that's the only way that data 1327 01:02:29,460 --> 01:02:31,930 can be accessed because, as you know, most of the things 1328 01:02:31,930 --> 01:02:32,780 are in memory. 1329 01:02:32,780 --> 01:02:35,400 So normally compiler [UNINTELLIGIBLE] is called 1330 01:02:35,400 --> 01:02:39,120 def-use chains, so defined to use, so we say that thing that 1331 01:02:39,120 --> 01:02:42,540 defined here is going to get used here, and nothing comes 1332 01:02:42,540 --> 01:02:43,750 in between that. 1333 01:02:43,750 --> 01:02:46,030 And that information is that's how the compiler 1334 01:02:46,030 --> 01:02:46,430 [UNINTELLIGIBLE]. 1335 01:02:46,430 --> 01:02:50,130 That's something we call dependence vectors. 1336 01:02:50,130 --> 01:02:53,680 We might talk a little bit about that when you go into 1337 01:02:53,680 --> 01:02:55,980 parallel execution. 1338 01:02:55,980 --> 01:02:59,980 So what makes it very hard for compiler to analyze this? 1339 01:02:59,980 --> 01:03:04,030 For example, address taken variables, so if you write and 1340 01:03:04,030 --> 01:03:06,260 hack with C, you can say, OK, there's a variable. 1341 01:03:06,260 --> 01:03:08,540 There's a variable here, I'm taking the address of that. 1342 01:03:08,540 --> 01:03:12,250 Suddenly, that means somebody else has the 1343 01:03:12,250 --> 01:03:13,390 address to the variable. 1344 01:03:13,390 --> 01:03:16,180 That means anybody else can suddenly jump in and overwrite 1345 01:03:16,180 --> 01:03:18,250 you, and there's a lot of possibilities of doing that. 1346 01:03:18,250 --> 01:03:20,720 And suddenly compiler says wait a minute, that variable, 1347 01:03:20,720 --> 01:03:24,390 even though I assigned the variable here, I'm using it 1348 01:03:24,390 --> 01:03:25,790 here, in between. 1349 01:03:25,790 --> 01:03:27,810 Somebody else might touch it even though it might not use 1350 01:03:27,810 --> 01:03:31,080 the same name because somebody has that address to that. 1351 01:03:31,080 --> 01:03:33,080 OK, so that's a hard thing. 1352 01:03:33,080 --> 01:03:36,730 Global variables, sometimes, because between function, I 1353 01:03:36,730 --> 01:03:37,080 don't know. 1354 01:03:37,080 --> 01:03:40,446 Some other function might go and change it. 1355 01:03:40,446 --> 01:03:41,790 Parameters are really hard. 1356 01:03:41,790 --> 01:03:43,890 Like for example, remember when we had a program, and we 1357 01:03:43,890 --> 01:03:46,950 had something like copying same array to the same, even 1358 01:03:46,950 --> 01:03:50,890 though parameters say X and Y. I might send the same or 1359 01:03:50,890 --> 01:03:53,860 overlapping regions into two different parameters even 1360 01:03:53,860 --> 01:03:56,840 though it looks like two different names. 1361 01:03:56,840 --> 01:03:57,890 They're not two different things. 1362 01:03:57,890 --> 01:04:00,000 They're actually overlapping at some point. 1363 01:04:00,000 --> 01:04:02,990 And so you had to assume, even if you have two different 1364 01:04:02,990 --> 01:04:07,340 parameters point into memory, they might be the same thing. 1365 01:04:07,340 --> 01:04:09,100 And that's the worst case, even though a lot of times, 1366 01:04:09,100 --> 01:04:09,840 nobody does that. 1367 01:04:09,840 --> 01:04:11,400 Nobody gives the same things multiple 1368 01:04:11,400 --> 01:04:13,370 names, but it's possible. 1369 01:04:13,370 --> 01:04:15,500 If it is possible, compilers deal with it. 1370 01:04:15,500 --> 01:04:18,700 Either it has to generate code to test all these cases, is it 1371 01:04:18,700 --> 01:04:20,480 overlapping, if not, do something. 1372 01:04:20,480 --> 01:04:23,470 If it is overlapping, do something slower, like the 1373 01:04:23,470 --> 01:04:25,360 code we showed when you are vectorizing. 1374 01:04:25,360 --> 01:04:32,640 You treat it like this huge number of different cases, but 1375 01:04:32,640 --> 01:04:34,930 unless you do something like that, you can't optimize, and 1376 01:04:34,930 --> 01:04:37,425 complex programs, it's very hard to do that. 1377 01:04:37,425 --> 01:04:42,750 A lot of times, pointers create issues in here because 1378 01:04:42,750 --> 01:04:45,640 the problem with pointers is what you call it point 1379 01:04:45,640 --> 01:04:48,150 aliasing, because pointers, you can add any value to a 1380 01:04:48,150 --> 01:04:51,220 pointer and you have no idea if you had a very large value. 1381 01:04:51,220 --> 01:04:54,750 It can be anywhere in memory because if you have a pointer, 1382 01:04:54,750 --> 01:04:56,890 you have a point in the memory you can add 1383 01:04:56,890 --> 01:04:58,790 anything, subtract anything. 1384 01:04:58,790 --> 01:05:02,010 The world is yours, and C gives you this ability to go 1385 01:05:02,010 --> 01:05:05,850 all over the world and kind of mapping the world, and some 1386 01:05:05,850 --> 01:05:07,230 programs do that. 1387 01:05:07,230 --> 01:05:08,840 And so the compiler says, oh, it's a point. 1388 01:05:08,840 --> 01:05:10,000 I don't know where it is. 1389 01:05:10,000 --> 01:05:13,110 I just have to leave it alone because some guy, probably 1390 01:05:13,110 --> 01:05:16,140 0.001% of the world programmers will do something 1391 01:05:16,140 --> 01:05:19,410 crazy, and everybody has to pay the price. 1392 01:05:19,410 --> 01:05:21,730 So this is what makes programming hard. 1393 01:05:21,730 --> 01:05:23,590 And the final thing is there's a thing called 1394 01:05:23,590 --> 01:05:25,570 [UNINTELLIGIBLE] types. 1395 01:05:25,570 --> 01:05:27,920 When you go to parallel programming you realize, 1396 01:05:27,920 --> 01:05:32,110 because normally compilers keep normal 1397 01:05:32,110 --> 01:05:33,792 values are in the memory. 1398 01:05:33,792 --> 01:05:37,250 Compiler can [UNINTELLIGIBLE] the value into register and 1399 01:05:37,250 --> 01:05:39,550 keep operating in the register, and at some point, 1400 01:05:39,550 --> 01:05:41,330 put it back to memory. 1401 01:05:41,330 --> 01:05:43,770 But if you're running a parallel program, somebody 1402 01:05:43,770 --> 01:05:47,100 else might want to look at that value, and if it isn't 1403 01:05:47,100 --> 01:05:48,910 registered, you don't have that value in the right place. 1404 01:05:48,910 --> 01:05:51,850 It's somewhere else, so you get a stale copy because you 1405 01:05:51,850 --> 01:05:52,420 have moved it. 1406 01:05:52,420 --> 01:05:54,855 What I'm trying to say is, look, you have to always keep 1407 01:05:54,855 --> 01:05:55,320 it in memory. 1408 01:05:55,320 --> 01:05:56,460 You can't take it out. 1409 01:05:56,460 --> 01:05:59,600 You can't just modify it, but you can move it somewhere else 1410 01:05:59,600 --> 01:06:03,970 the faster place to do things to it because somebody else 1411 01:06:03,970 --> 01:06:05,250 might be looking at it. 1412 01:06:05,250 --> 01:06:07,320 And so what that means is compilers give up it's hands 1413 01:06:07,320 --> 01:06:08,570 and say, look, I can't do anything. 1414 01:06:11,970 --> 01:06:14,390 So we are a little bit early. 1415 01:06:14,390 --> 01:06:17,580 I have yet another huge session in here at-- 1416 01:06:17,580 --> 01:06:19,360 OK, we have to go through this thing. 1417 01:06:19,360 --> 01:06:21,510 Good. 1418 01:06:21,510 --> 01:06:28,510 I think now we are going to go about and see how you guys did 1419 01:06:28,510 --> 01:06:31,080 in the class exam. 1420 01:06:31,080 --> 01:06:32,360 OK. 1421 01:06:32,360 --> 01:06:34,000 And I'm seeing it for the first time, and it looks 1422 01:06:34,000 --> 01:06:34,890 really nice. 1423 01:06:34,890 --> 01:06:36,900 Where do you plug this in? 1424 01:06:36,900 --> 01:06:38,150 Where do you plug this in? 1425 01:06:51,890 --> 01:06:58,440 OK, so here is the distribution in there. 1426 01:06:58,440 --> 01:07:02,190 This was not an easy exam, and in fact, we compared how you 1427 01:07:02,190 --> 01:07:04,990 guys did last year, and you guys have done a lot better 1428 01:07:04,990 --> 01:07:07,260 than I think the first exam in last year. 1429 01:07:07,260 --> 01:07:15,080 So basically, we have a median about 70, somewhere here, and 1430 01:07:15,080 --> 01:07:20,060 a nice tight grouping in here, which is really good. 1431 01:07:20,060 --> 01:07:25,840 And so what we have is, we have exams back. 1432 01:07:25,840 --> 01:07:26,946 Take a look. 1433 01:07:26,946 --> 01:07:30,792 And I think-- 1434 01:07:30,792 --> 01:07:32,760 GUEST SPEAKER: I'd like to make one comment about 1435 01:07:32,760 --> 01:07:33,990 [INAUDIBLE]. 1436 01:07:33,990 --> 01:07:35,240 PROFESSOR: OK, sure. 1437 01:07:40,114 --> 01:07:41,364 GUEST SPEAKER: [INAUDIBLE]. 1438 01:07:51,950 --> 01:08:02,860 So not surprisingly, I graded the problem on the cache 1439 01:08:02,860 --> 01:08:07,130 oblivious algorithm doing the recursion tree. 1440 01:08:07,130 --> 01:08:15,530 There is a common mistake that many people made, which I 1441 01:08:15,530 --> 01:08:19,819 wanted to explain why it's wrong because so many people 1442 01:08:19,819 --> 01:08:20,880 made this mistake. 1443 01:08:20,880 --> 01:08:23,740 They got it almost all right, and then 1444 01:08:23,740 --> 01:08:24,880 they made this mistake. 1445 01:08:24,880 --> 01:08:28,600 So it's basically an understanding of recurrence. 1446 01:08:28,600 --> 01:08:33,420 So the recurrences I recall was q of r is equal to square 1447 01:08:33,420 --> 01:08:51,824 root of r over b if square root of r is less than CM for 1448 01:08:51,824 --> 01:08:52,819 C, et cetera. 1449 01:08:52,819 --> 01:08:53,800 OK? 1450 01:08:53,800 --> 01:09:02,279 And then otherwise, it was 2q of r over 2 plus theta 1. 1451 01:09:05,500 --> 01:09:09,300 Now, what people did in their recursion tree-- 1452 01:09:09,300 --> 01:09:11,520 first of all, some people didn't recognize that what 1453 01:09:11,520 --> 01:09:14,890 goes in the recursion tree is this value, the number of 1454 01:09:14,890 --> 01:09:15,810 cache misses. 1455 01:09:15,810 --> 01:09:19,120 So the recursion tree is going to look like theta 1, or you 1456 01:09:19,120 --> 01:09:20,680 can leave out the thetas if you want to put 1457 01:09:20,680 --> 01:09:22,560 them in at the end. 1458 01:09:22,560 --> 01:09:25,170 Theta 1, theta 1, et cetera. 1459 01:09:25,170 --> 01:09:28,149 So many people got this, and then the question is what 1460 01:09:28,149 --> 01:09:31,529 happens when it hits the leaf. 1461 01:09:31,529 --> 01:09:32,649 OK? 1462 01:09:32,649 --> 01:09:37,010 So when it hits a leaf, many people correctly got that you 1463 01:09:37,010 --> 01:09:39,670 can't mess around with constants. 1464 01:09:39,670 --> 01:09:41,694 You have to be very careful of constants if they're in an 1465 01:09:41,694 --> 01:09:51,689 exponent, that you hit the leaf when square root of r 1466 01:09:51,689 --> 01:09:55,260 becomes less than c over m, in which case the cost is going 1467 01:09:55,260 --> 01:09:57,060 to be square root of r over b. 1468 01:09:57,060 --> 01:09:59,920 So what they did was the incorrect thing, was they put 1469 01:09:59,920 --> 01:10:03,470 square root of r over b here. 1470 01:10:03,470 --> 01:10:04,720 Why is that wrong? 1471 01:10:07,090 --> 01:10:09,500 [INTERPOSING VOICES] 1472 01:10:09,500 --> 01:10:11,890 GUEST SPEAKER: It's the wrong r. 1473 01:10:11,890 --> 01:10:13,140 Right? 1474 01:10:15,750 --> 01:10:17,660 OK, it's the wrong r. 1475 01:10:17,660 --> 01:10:20,150 This r is the r here on the right-hand side. 1476 01:10:20,150 --> 01:10:20,950 It's not the one here. 1477 01:10:20,950 --> 01:10:24,920 It's the r if r is sufficiently small, that's the 1478 01:10:24,920 --> 01:10:25,640 value you're taking. 1479 01:10:25,640 --> 01:10:29,380 But we're expanding an r from the top here. 1480 01:10:29,380 --> 01:10:33,250 So what's the value that should go here? 1481 01:10:33,250 --> 01:10:37,310 OK, cm is the value that should go here. 1482 01:10:37,310 --> 01:10:38,680 OK? 1483 01:10:38,680 --> 01:10:40,090 The value that should go here is cm. 1484 01:10:42,590 --> 01:10:43,070 Yes? 1485 01:10:43,070 --> 01:10:45,474 AUDIENCE: [INAUDIBLE] 1486 01:10:45,474 --> 01:10:49,322 two r's can actually follow the right-side? 1487 01:10:49,322 --> 01:10:51,246 And then it's very close to where they're written the same 1488 01:10:51,246 --> 01:10:53,180 but are spoken differently. 1489 01:10:53,180 --> 01:10:55,640 GUEST SPEAKER: Well, when you say-- what do you mean? 1490 01:10:55,640 --> 01:10:57,030 AUDIENCE: [INAUDIBLE]. 1491 01:10:57,030 --> 01:10:58,410 GUEST SPEAKER: There's an r here. 1492 01:10:58,410 --> 01:11:00,354 AUDIENCE: And it's different from the other r-- 1493 01:11:00,354 --> 01:11:01,700 GUEST SPEAKER: No, it's the same r. 1494 01:11:01,700 --> 01:11:04,900 The question is there, r is a variable. 1495 01:11:04,900 --> 01:11:07,600 So it'd be nice if the r were constant, but it's not. 1496 01:11:07,600 --> 01:11:09,100 It's a variable. 1497 01:11:09,100 --> 01:11:12,850 And so the point is the point where you plug it in here, 1498 01:11:12,850 --> 01:11:15,215 you've got to plug in, not the variable, you've got to plug 1499 01:11:15,215 --> 01:11:16,470 in the value. 1500 01:11:16,470 --> 01:11:20,935 AUDIENCE: You just said for r [UNINTELLIGIBLE]. 1501 01:11:20,935 --> 01:11:21,575 GUEST SPEAKER: It's a variable. 1502 01:11:21,575 --> 01:11:25,530 You have to plug in the value of the variable at this point 1503 01:11:25,530 --> 01:11:27,260 if you're going to solve the recurrence. 1504 01:11:27,260 --> 01:11:29,880 Putting an r here, we're trying to I say, this whole 1505 01:11:29,880 --> 01:11:32,490 thing is q of r. 1506 01:11:32,490 --> 01:11:35,150 And we started out, if we did the development of the tree, 1507 01:11:35,150 --> 01:11:38,510 which is the safest thing to do, you get theta 1 plus q of 1508 01:11:38,510 --> 01:11:44,340 r over 2, and you keep going down until your value for r 1509 01:11:44,340 --> 01:11:46,960 satisfies this condition. 1510 01:11:46,960 --> 01:11:50,830 At that point, what's the value for r? 1511 01:11:50,830 --> 01:11:51,520 OK? 1512 01:11:51,520 --> 01:11:55,140 You can't then say it's the same r that you started with. 1513 01:11:55,140 --> 01:11:59,160 It's not this r, and that's because r is a variable, not 1514 01:11:59,160 --> 01:12:01,510 because of anything else. 1515 01:12:01,510 --> 01:12:04,050 r is a variable, and we're using the r. 1516 01:12:04,050 --> 01:12:05,670 This is a question of understanding of the 1517 01:12:05,670 --> 01:12:07,865 recurrence. 1518 01:12:07,865 --> 01:12:09,660 So in any case, that was a common 1519 01:12:09,660 --> 01:12:11,650 mistake that people make. 1520 01:12:11,650 --> 01:12:15,360 The other minor error that people made on that problem, 1521 01:12:15,360 --> 01:12:20,110 that most people made, was in describing where do you get 1522 01:12:20,110 --> 01:12:25,840 this recurrence, they left out the fact is why is it going to 1523 01:12:25,840 --> 01:12:28,120 be square root of r over b. 1524 01:12:28,120 --> 01:12:32,420 It's really because na is approximately nb because the 1525 01:12:32,420 --> 01:12:35,800 way that the code works, we're keeping na and nb to within a 1526 01:12:35,800 --> 01:12:37,820 factor of two of each other. 1527 01:12:37,820 --> 01:12:38,510 OK? 1528 01:12:38,510 --> 01:12:40,480 And so if you didn't mention that, you lost a point. 1529 01:12:40,480 --> 01:12:44,640 It wasn't a big deal, but many people didn't neglect that 1530 01:12:44,640 --> 01:12:47,560 very important statement. 1531 01:12:47,560 --> 01:12:50,400 Overall, people did very well on this problem. 1532 01:12:50,400 --> 01:12:54,510 Overall, you'll see people got a lot of partial credit on it.