1 00:00:02,190 --> 00:00:04,640 The following content is provided under a Creative 2 00:00:04,640 --> 00:00:06,050 Commons license. 3 00:00:06,050 --> 00:00:09,090 Your support will help MIT OpenCourseWare continue to 4 00:00:09,090 --> 00:00:12,740 offer high quality educational resources for free. 5 00:00:12,740 --> 00:00:15,630 To make a donation or view additional materials from 6 00:00:15,630 --> 00:00:19,560 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:19,560 --> 00:00:20,810 ocw.mit.edu. 8 00:00:31,380 --> 00:00:36,900 PROFESSOR: OK, so by this time, most of you should have 9 00:00:36,900 --> 00:00:40,790 had your meetings with your masters. 10 00:00:40,790 --> 00:00:44,650 How many of you had meetings with your masters? 11 00:00:44,650 --> 00:00:47,130 Who didn't yet? 12 00:00:47,130 --> 00:00:47,610 AUDIENCE: Today. 13 00:00:47,610 --> 00:00:48,370 PROFESSOR: Today? 14 00:00:48,370 --> 00:00:49,450 And Friday. 15 00:00:49,450 --> 00:00:53,060 Who hasn't scheduled a meeting? 16 00:00:53,060 --> 00:00:55,846 So are you talking to them, and do you know-- 17 00:00:55,846 --> 00:00:57,100 AUDIENCE: [INAUDIBLE] 18 00:00:57,100 --> 00:01:01,230 PROFESSOR: OK, make sure you get masters meetings scheduled 19 00:01:01,230 --> 00:01:03,130 because this is a very important part. 20 00:01:03,130 --> 00:01:04,890 We are getting feedback from the masters. 21 00:01:04,890 --> 00:01:08,820 I think they are giving you guys a lot of good advice, and 22 00:01:08,820 --> 00:01:14,900 so make sure you schedule, you go, you meet, get that result, 23 00:01:14,900 --> 00:01:15,590 use that result. 24 00:01:15,590 --> 00:01:19,870 It's very hard to arrange that kind of a senior person 25 00:01:19,870 --> 00:01:22,280 spending a lot of time with only two of 26 00:01:22,280 --> 00:01:23,520 you at a given time. 27 00:01:23,520 --> 00:01:25,940 That's a really good ratio to have, so take 28 00:01:25,940 --> 00:01:28,330 advantage of that. 29 00:01:28,330 --> 00:01:34,630 OK, so also, I think at this point, most of the beta issues 30 00:01:34,630 --> 00:01:40,275 for project one, we have been probably worked with, and they 31 00:01:40,275 --> 00:01:44,390 have some issues some people looked at that they are 32 00:01:44,390 --> 00:01:45,210 performance code. 33 00:01:45,210 --> 00:01:48,850 They were not happy because the thing is this. 34 00:01:48,850 --> 00:01:53,920 You can always run everything instantaneously if it doesn't 35 00:01:53,920 --> 00:01:56,480 have to produce a correct answer. 36 00:01:56,480 --> 00:02:00,580 You can just return, and so part of performance is also it 37 00:02:00,580 --> 00:02:03,590 has to have some correctness. 38 00:02:03,590 --> 00:02:05,760 So certain amount of correctness has to be passed 39 00:02:05,760 --> 00:02:07,270 before we can give performer grade. 40 00:02:07,270 --> 00:02:10,490 So some people have this question. 41 00:02:10,490 --> 00:02:12,850 Why are you giving me a correctness grade because it 42 00:02:12,850 --> 00:02:14,240 seemed to run? 43 00:02:14,240 --> 00:02:16,970 But if you haven't done the right checks and stuff like 44 00:02:16,970 --> 00:02:18,940 that, then it's unfair for other people because there's 45 00:02:18,940 --> 00:02:21,600 no way people who do the correct implementation can 46 00:02:21,600 --> 00:02:23,390 match that performance because we're using 47 00:02:23,390 --> 00:02:24,550 that as upper bound. 48 00:02:24,550 --> 00:02:26,110 So make sure you do that. 49 00:02:26,110 --> 00:02:30,660 And another issue people had was when you go to masters, 50 00:02:30,660 --> 00:02:34,730 you're going to work with one or two other students. 51 00:02:34,730 --> 00:02:38,470 And in your group, you're going to share all the 52 00:02:38,470 --> 00:02:42,260 material with your group member and nobody else. 53 00:02:42,260 --> 00:02:47,680 But between the beta and final, you get opportunity to 54 00:02:47,680 --> 00:02:52,460 basically get input from your masters, and also see what the 55 00:02:52,460 --> 00:02:55,720 other people who are coming to a project in your 56 00:02:55,720 --> 00:02:57,330 discussion have done. 57 00:02:57,330 --> 00:02:59,100 So it's OK for you to see those code. 58 00:02:59,100 --> 00:03:00,370 You can't take the code home. 59 00:03:00,370 --> 00:03:02,000 You can't just say give me a print-out, I am going to take 60 00:03:02,000 --> 00:03:02,860 home and re-implement. 61 00:03:02,860 --> 00:03:05,360 That doesn't count, but if you see some clever things they 62 00:03:05,360 --> 00:03:07,560 have done, it's all here to learn. 63 00:03:07,560 --> 00:03:10,240 Class is all the mostly about learning. 64 00:03:10,240 --> 00:03:12,650 However we have to grade you, so it's not all about grading, 65 00:03:12,650 --> 00:03:13,670 it's about learning. 66 00:03:13,670 --> 00:03:16,650 So we are trying to get a balance where we opportunity 67 00:03:16,650 --> 00:03:18,890 for you to learn from your peers. 68 00:03:18,890 --> 00:03:21,400 So if you go look at other people's code, if you find 69 00:03:21,400 --> 00:03:23,400 something cool they have done, learn. 70 00:03:23,400 --> 00:03:24,950 And also if you [UNINTELLIGIBLE] 71 00:03:24,950 --> 00:03:26,130 for you [UNINTELLIGIBLE] 72 00:03:26,130 --> 00:03:28,150 say, look, I did something cool. 73 00:03:28,150 --> 00:03:31,240 It's a good learning experience in there, so that's 74 00:03:31,240 --> 00:03:34,910 one opportunity you have to talk to other people on other 75 00:03:34,910 --> 00:03:37,850 groups, to basically learn something or find some 76 00:03:37,850 --> 00:03:40,510 interesting tidbits of how to get good performance. 77 00:03:40,510 --> 00:03:43,500 And hopefully, you learn some interesting things and able to 78 00:03:43,500 --> 00:03:45,770 implement that in your final. 79 00:03:45,770 --> 00:03:46,790 OK? 80 00:03:46,790 --> 00:03:51,370 So today we are going to talk about memory systems and 81 00:03:51,370 --> 00:03:53,900 performance engineering. 82 00:03:53,900 --> 00:04:00,640 So if you look at basic idea of what memory systems, you 83 00:04:00,640 --> 00:04:03,870 want to build a computer that seemed to have 84 00:04:03,870 --> 00:04:05,860 a really fast memory. 85 00:04:05,860 --> 00:04:07,750 You don't want things to be slow. 86 00:04:07,750 --> 00:04:10,110 For example, a long time ago, people build these computers 87 00:04:10,110 --> 00:04:13,590 where memory's always slow, everything [UNINTELLIGIBLE] 88 00:04:13,590 --> 00:04:14,350 further away. 89 00:04:14,350 --> 00:04:15,872 It doesn't help anybody. 90 00:04:15,872 --> 00:04:20,190 The way we know how to do that is to build small amount of 91 00:04:20,190 --> 00:04:23,460 memory very close to the processor. 92 00:04:23,460 --> 00:04:25,860 You can't build a huge amount of things all close to you. 93 00:04:25,860 --> 00:04:29,890 That doesn't work, doesn't scale like that. 94 00:04:29,890 --> 00:04:32,090 And we made a cache just like that, and you build larger and 95 00:04:32,090 --> 00:04:35,060 larger memory [UNINTELLIGIBLE] going down, and the illusion 96 00:04:35,060 --> 00:04:37,550 is you want to give the illusion of you have huge 97 00:04:37,550 --> 00:04:39,080 amount of memory. 98 00:04:39,080 --> 00:04:41,540 Everybody's very close to you. 99 00:04:41,540 --> 00:04:44,370 OK, so it looks like you have millions of friends, that you 100 00:04:44,370 --> 00:04:46,870 talk to everybody, and that doesn't happen. 101 00:04:46,870 --> 00:04:49,040 But how do you give that illusion? 102 00:04:49,040 --> 00:04:52,670 And the way that you give that illusion is when you use 103 00:04:52,670 --> 00:04:56,510 normal programming practice, when you normally using data, 104 00:04:56,510 --> 00:04:58,065 people have found there are two properties. 105 00:05:01,820 --> 00:05:04,010 The first thing is called temporal locality. 106 00:05:04,010 --> 00:05:08,290 That means if I use some data item, there's a good chance I 107 00:05:08,290 --> 00:05:11,400 use that data item again very soon. 108 00:05:11,400 --> 00:05:16,150 There is something I haven't used for a long time, probably 109 00:05:16,150 --> 00:05:18,360 has a good chance I will not use it for a 110 00:05:18,360 --> 00:05:19,520 very, very long time. 111 00:05:19,520 --> 00:05:21,590 So can I take advantage of temporal locality? 112 00:05:21,590 --> 00:05:24,505 So that mean some data will be get for use a lot of time, in 113 00:05:24,505 --> 00:05:27,270 a very quick [UNINTELLIGIBLE], so those things should be very 114 00:05:27,270 --> 00:05:29,380 near to you because you might. 115 00:05:29,380 --> 00:05:31,060 Other one is spatial locality. 116 00:05:31,060 --> 00:05:33,640 That means if I use some data item, there's a good chance 117 00:05:33,640 --> 00:05:36,760 you'll be using data items next to that, closer to that. 118 00:05:36,760 --> 00:05:38,700 So can we take advantage of that? 119 00:05:38,700 --> 00:05:43,500 So those two properties help you help the compiler to fool 120 00:05:43,500 --> 00:05:46,940 the system, fool everybody, thinking that it has this huge 121 00:05:46,940 --> 00:05:50,310 amount of memory very close to you, but internally doesn't. 122 00:05:50,310 --> 00:05:53,690 Unfortunately, since it's trying to fool you, it doesn't 123 00:05:53,690 --> 00:05:56,110 work all the time, and when it doesn't work, it's good to 124 00:05:56,110 --> 00:06:01,000 recognize and able to fix those things. 125 00:06:01,000 --> 00:06:05,280 So I showed this picture sometime ago, too. 126 00:06:05,280 --> 00:06:08,440 Memories in a big hierarchy. 127 00:06:08,440 --> 00:06:12,630 L1, L2, in fact, we have L3 cache and a memory, disk, 128 00:06:12,630 --> 00:06:14,130 tapes, it can go up and down. 129 00:06:14,130 --> 00:06:16,662 And the key thing is when you go here, you're starting from 130 00:06:16,662 --> 00:06:18,630 registers, you have very small amount of 131 00:06:18,630 --> 00:06:20,570 things, very fast access. 132 00:06:20,570 --> 00:06:22,440 Here, we have almost infinite amount of 133 00:06:22,440 --> 00:06:24,290 things, very slow access. 134 00:06:24,290 --> 00:06:28,500 And the two reasons why this has to be smart because first 135 00:06:28,500 --> 00:06:30,510 of all, those things are very expensive. 136 00:06:30,510 --> 00:06:32,140 And second of course, you can't [UNINTELLIGIBLE] too 137 00:06:32,140 --> 00:06:34,680 much, and when you go down, things get somewhat cheaper. 138 00:06:34,680 --> 00:06:37,750 So that was one of the kind of economic incentive here. 139 00:06:40,900 --> 00:06:43,690 So I talked about cache issues, and I'm going to go 140 00:06:43,690 --> 00:06:46,450 through this again because we are really going to dig deep 141 00:06:46,450 --> 00:06:47,730 into these things. 142 00:06:47,730 --> 00:06:50,430 So when you have a cache, that means you have a small memory 143 00:06:50,430 --> 00:06:53,790 close to you that's basically shadowing something sitting 144 00:06:53,790 --> 00:06:55,070 behind you. 145 00:06:55,070 --> 00:06:58,490 So when you ask for the data in the cache, there are many 146 00:06:58,490 --> 00:07:01,100 reasons why it's not there. 147 00:07:01,100 --> 00:07:03,090 First thing is cold miss. 148 00:07:03,090 --> 00:07:05,760 What that means is you're asking for something that you 149 00:07:05,760 --> 00:07:06,645 have never asked before. 150 00:07:06,645 --> 00:07:09,500 It's the first time you're asking, so the data is 151 00:07:09,500 --> 00:07:11,890 probably sitting way behind in your memory where, 152 00:07:11,890 --> 00:07:13,180 [UNINTELLIGIBLE] 153 00:07:13,180 --> 00:07:13,580 in a disk. 154 00:07:13,580 --> 00:07:15,330 We haven't even loaded it in there. 155 00:07:15,330 --> 00:07:21,780 And the first time you have to get it, so that get pulled in, 156 00:07:21,780 --> 00:07:26,020 and so this seems to be a place where there's nothing 157 00:07:26,020 --> 00:07:27,080 you can do. 158 00:07:27,080 --> 00:07:32,460 But another part of locality is a thing called prefetching. 159 00:07:32,460 --> 00:07:37,350 These modern processors have a crystal ball. 160 00:07:37,350 --> 00:07:39,780 They carry most of the things that branch prediction and 161 00:07:39,780 --> 00:07:43,980 stuff is trying to predict what you want to do. 162 00:07:43,980 --> 00:07:46,320 You can do the same thing in memory. 163 00:07:46,320 --> 00:07:48,550 You can look at what you have done. 164 00:07:48,550 --> 00:07:51,470 And Intel has this huge amount of circuitry that they don't 165 00:07:51,470 --> 00:07:54,380 tell anybody what they do, but they're trying to predict what 166 00:07:54,380 --> 00:07:57,060 you might fetch next, what data you might look for next. 167 00:07:57,060 --> 00:08:01,610 And if it is working well, something you had never, ever 168 00:08:01,610 --> 00:08:04,590 looked before, it might deduce that you might need it and go 169 00:08:04,590 --> 00:08:05,560 get it for you. 170 00:08:05,560 --> 00:08:10,130 So if that works, great. 171 00:08:10,130 --> 00:08:12,790 So then there's thing called capacity miss. 172 00:08:12,790 --> 00:08:18,240 What that means is as I said, I am trying to keep my best 173 00:08:18,240 --> 00:08:22,330 friends, or the people I will be working with a lot-- 174 00:08:22,330 --> 00:08:23,000 Oh, nice. 175 00:08:23,000 --> 00:08:26,240 What's going on here? 176 00:08:26,240 --> 00:08:28,040 Please start later. 177 00:08:28,040 --> 00:08:29,290 OK. 178 00:08:32,169 --> 00:08:35,820 --the people I'm going to work a lot with, close to me. 179 00:08:35,820 --> 00:08:38,880 The problem is I can have only certain number of friends, and 180 00:08:38,880 --> 00:08:41,890 if you have more than that, I can't keep everybody. 181 00:08:41,890 --> 00:08:46,430 So hopefully, your program is going to use a lot what we 182 00:08:46,430 --> 00:08:49,400 call the working set that can fit in the cache. 183 00:08:49,400 --> 00:08:52,030 And at that point, you get all of those people, like I said, 184 00:08:52,030 --> 00:08:56,600 get everybody into one room, and the problem is if your 185 00:08:56,600 --> 00:08:58,150 party spills over from the room, then 186 00:08:58,150 --> 00:08:59,340 there's a lot of issues. 187 00:08:59,340 --> 00:09:01,570 But if everybody fits in the room, things are very nice. 188 00:09:01,570 --> 00:09:03,390 You can have a good interaction, continuous 189 00:09:03,390 --> 00:09:04,850 interaction, with them. 190 00:09:04,850 --> 00:09:10,900 And what happens is, if not, the worst case is the caches 191 00:09:10,900 --> 00:09:15,370 have eviction policy, normally is called least recently used. 192 00:09:15,370 --> 00:09:18,480 That means if I don't have room in the cache, I figure 193 00:09:18,480 --> 00:09:21,500 out the cache line that I haven't touched for longest, 194 00:09:21,500 --> 00:09:24,710 and I get rid of that, bring the next one. 195 00:09:24,710 --> 00:09:27,250 So that's a good policy. 196 00:09:27,250 --> 00:09:29,740 It fits in your locality thing, but when 197 00:09:29,740 --> 00:09:30,990 will did not work? 198 00:09:33,620 --> 00:09:34,220 [UNINTELLIGIBLE] 199 00:09:34,220 --> 00:09:39,880 going to figure out when least recently used will create a 200 00:09:39,880 --> 00:09:42,257 really bad problem in capacity. 201 00:09:42,257 --> 00:09:43,507 AUDIENCE: [INAUDIBLE] 202 00:09:45,410 --> 00:09:46,911 just one. 203 00:09:46,911 --> 00:09:51,060 Say you're [INAUDIBLE] set, and that set is, by one, 204 00:09:51,060 --> 00:09:52,890 bigger than the [INAUDIBLE] 205 00:09:52,890 --> 00:09:53,630 missing. 206 00:09:53,630 --> 00:09:54,205 PROFESSOR: Very good. 207 00:09:54,205 --> 00:09:56,900 So what you're going through, you're going round some data 208 00:09:56,900 --> 00:10:01,540 again and again, and that amount of data is one bigger 209 00:10:01,540 --> 00:10:03,210 than your cache line. 210 00:10:03,210 --> 00:10:04,670 You are going to have no locality because 211 00:10:04,670 --> 00:10:07,830 [UNINTELLIGIBLE], and when you go to, just before you use 212 00:10:07,830 --> 00:10:10,360 that data, they said, oops, I need to bring one more data. 213 00:10:10,360 --> 00:10:13,950 I need to evict something, and who am I going to evict? 214 00:10:13,950 --> 00:10:17,630 The guy I'm going to use next because it was the least 215 00:10:17,630 --> 00:10:18,340 recently used thing. 216 00:10:18,340 --> 00:10:21,100 So it goes, and then you go there, and you bring that. 217 00:10:21,100 --> 00:10:21,660 Who's it going to evict? 218 00:10:21,660 --> 00:10:23,580 You're going to evict the one I just brought. 219 00:10:23,580 --> 00:10:26,630 You're going to evict the next one that I am just about to 220 00:10:26,630 --> 00:10:28,020 use and bring that one. 221 00:10:28,020 --> 00:10:30,450 And you go use that, and then to use the next one, you're 222 00:10:30,450 --> 00:10:32,690 going evict the one you're just about to use. 223 00:10:32,690 --> 00:10:35,480 So by going and doing that, you basically get no locality. 224 00:10:35,480 --> 00:10:37,314 So that's a really bad problem in here. 225 00:10:39,920 --> 00:10:43,370 So another interesting thing is called a conflict misses. 226 00:10:43,370 --> 00:10:46,920 Conflict misses says normally, when you have a cache-- 227 00:10:46,920 --> 00:10:50,360 so what we have is you have this large storage, which 228 00:10:50,360 --> 00:10:51,050 [UNINTELLIGIBLE] 229 00:10:51,050 --> 00:10:55,100 the memory, you're going to map into smaller storage in 230 00:10:55,100 --> 00:10:57,420 here, which is cache. 231 00:10:57,420 --> 00:11:01,620 So one way to say is any of this value line can be 232 00:11:01,620 --> 00:11:03,550 anywhere here. 233 00:11:03,550 --> 00:11:05,720 That's called fully associative cache. 234 00:11:08,670 --> 00:11:12,000 Implementing that is somewhat complicated, so the other 235 00:11:12,000 --> 00:11:13,940 extreme is called direct map cache. 236 00:11:13,940 --> 00:11:17,080 What that means is this segment, the same size here, 237 00:11:17,080 --> 00:11:20,570 can map into here, and then the next segment also 238 00:11:20,570 --> 00:11:21,370 can map into here. 239 00:11:21,370 --> 00:11:25,490 That means if you get this cache line here, it can be 240 00:11:25,490 --> 00:11:30,200 only mapped into one place in this here. 241 00:11:30,200 --> 00:11:33,800 This cache line can only be here, and also this cache line 242 00:11:33,800 --> 00:11:37,620 can also only be here if it is the same offset. 243 00:11:37,620 --> 00:11:40,000 So that means for every cache line, there's only one 244 00:11:40,000 --> 00:11:41,980 location in the cache. 245 00:11:41,980 --> 00:11:45,550 So here, what would be really sad scenario? 246 00:11:51,214 --> 00:11:53,580 AUDIENCE: [INAUDIBLE] 247 00:11:53,580 --> 00:11:56,006 PROFESSOR: Yeah, I mean I assume I have a 248 00:11:56,006 --> 00:11:57,010 lot of these slots. 249 00:11:57,010 --> 00:11:59,860 I read something here, and the next time I read something 250 00:11:59,860 --> 00:12:01,750 here, next time I read something here, and I go round 251 00:12:01,750 --> 00:12:03,350 and round this way. 252 00:12:03,350 --> 00:12:07,000 I might be touching only very few number of data items, but 253 00:12:07,000 --> 00:12:09,630 they also mapping into one cache line even though my 254 00:12:09,630 --> 00:12:13,320 working set is very small than the cache. 255 00:12:13,320 --> 00:12:18,340 I still don't have any cache usage 256 00:12:18,340 --> 00:12:20,840 because I am having conflicts. 257 00:12:20,840 --> 00:12:23,430 And then there are two other misses when you're going to 258 00:12:23,430 --> 00:12:24,220 multi-process. 259 00:12:24,220 --> 00:12:26,800 We'll go through them in later lectures. 260 00:12:26,800 --> 00:12:28,270 One is called true sharing. 261 00:12:28,270 --> 00:12:33,560 That means there are caches private to each core, and what 262 00:12:33,560 --> 00:12:35,780 happens is if one core uses some data, you've have to get 263 00:12:35,780 --> 00:12:37,330 the cache line there. 264 00:12:37,330 --> 00:12:39,100 If the other core wants to use that data, you have to bring 265 00:12:39,100 --> 00:12:42,220 the cache line back there, and the cache lines can ping-pong 266 00:12:42,220 --> 00:12:43,940 [UNINTELLIGIBLE] that. 267 00:12:43,940 --> 00:12:46,090 False sharing is even a worst way. 268 00:12:46,090 --> 00:12:50,700 So what happens is if the first processor here touches 269 00:12:50,700 --> 00:12:53,910 this value of the cache line, second processor touches this 270 00:12:53,910 --> 00:12:55,841 value of the cache line. 271 00:12:55,841 --> 00:12:59,320 We're not sharing anything, but unfortunately, the two 272 00:12:59,320 --> 00:13:01,920 things I'm using sits next to each other, so when I ride 273 00:13:01,920 --> 00:13:04,500 this thing, I get the cache line, he doesn't. 274 00:13:04,500 --> 00:13:05,660 When he need to ride this thing, he has to 275 00:13:05,660 --> 00:13:06,510 get the cache line. 276 00:13:06,510 --> 00:13:10,860 So I am seemingly using independent data, but I am 277 00:13:10,860 --> 00:13:13,720 bouncing cache line in between, and if that happened, 278 00:13:13,720 --> 00:13:16,880 that going to have a huge big performance impact. 279 00:13:16,880 --> 00:13:17,610 So these other things. 280 00:13:17,610 --> 00:13:20,850 The last two, we'll get to a little bit later. 281 00:13:20,850 --> 00:13:26,010 So today I am going to start with modeling how caches work 282 00:13:26,010 --> 00:13:28,180 with a very simplistic cache. 283 00:13:28,180 --> 00:13:30,840 So here's, assume, my cache. 284 00:13:30,840 --> 00:13:32,730 I have 32 kilobytes in here. 285 00:13:35,300 --> 00:13:36,640 This is a direct map cache. 286 00:13:36,640 --> 00:13:41,600 That means that in my memory, 32 kilobyte chucks get mapped 287 00:13:41,600 --> 00:13:42,900 to direct [UNINTELLIGIBLE]. 288 00:13:42,900 --> 00:13:45,060 This 32 kilobyte get mapped here. 289 00:13:45,060 --> 00:13:47,970 That 32 kilobyte get mapped here. 290 00:13:47,970 --> 00:13:57,890 And the cache line inside is 64 bytes, 64, and that means I 291 00:13:57,890 --> 00:14:00,780 have basically [UNINTELLIGIBLE PHRASE]. 292 00:14:03,280 --> 00:14:04,590 32 kilobyte in here. 293 00:14:04,590 --> 00:14:05,700 OK? 294 00:14:05,700 --> 00:14:12,770 So just remember this as we go on doing this, we will use 295 00:14:12,770 --> 00:14:13,530 this as a formula. 296 00:14:13,530 --> 00:14:14,130 OK? 297 00:14:14,130 --> 00:14:16,340 And we assume if you getting the cache, [UNINTELLIGIBLE] 298 00:14:16,340 --> 00:14:19,400 single cycle, if you miss 100 cycles. 299 00:14:19,400 --> 00:14:22,210 OK, so it's nice numbers to do that. 300 00:14:22,210 --> 00:14:28,630 And so the first thing is we have a reference like this. 301 00:14:28,630 --> 00:14:29,210 OK? 302 00:14:29,210 --> 00:14:35,820 You go from [i] equals 0 to [i] less than very large 303 00:14:35,820 --> 00:14:38,640 number, A[i]. 304 00:14:38,640 --> 00:14:41,600 I just go to accessing memory like one after another after 305 00:14:41,600 --> 00:14:42,850 another after another. 306 00:14:46,610 --> 00:14:49,070 So you see how this is going on in here? 307 00:14:49,070 --> 00:14:51,580 So also I am accessing integers, so that means four 308 00:14:51,580 --> 00:14:54,460 bytes at a time in cache line. 309 00:14:54,460 --> 00:14:55,200 So what happens here? 310 00:14:55,200 --> 00:14:58,800 So I assume size of int is four, so you're getting four 311 00:14:58,800 --> 00:15:01,490 bytes in here. 312 00:15:01,490 --> 00:15:03,610 So yeah, you're doing s reads to A. I'm accessing 313 00:15:03,610 --> 00:15:06,240 s elements in here. 314 00:15:06,240 --> 00:15:10,490 And you have 16 elements of a per cache line. 315 00:15:10,490 --> 00:15:12,220 OK, you see why 16 if there? 316 00:15:12,220 --> 00:15:19,010 Because I have 64 in here, 64 bytes in here. 317 00:15:19,010 --> 00:15:22,340 Each time axises four bytes, so I have 16 of integers in 318 00:15:22,340 --> 00:15:24,748 cache line. 319 00:15:24,748 --> 00:15:26,134 AUDIENCE: [INAUDIBLE] 320 00:15:26,134 --> 00:15:27,520 bits or bytes? 321 00:15:27,520 --> 00:15:28,330 PROFESSOR: Bytes. 322 00:15:28,330 --> 00:15:29,010 Byte. 323 00:15:29,010 --> 00:15:29,775 Did I say bits? 324 00:15:29,775 --> 00:15:32,690 No, bytes. 325 00:15:32,690 --> 00:15:37,322 So what happens is if it axises the same element, 15 of 326 00:15:37,322 --> 00:15:40,680 every 16 is in the cache because when we are going one 327 00:15:40,680 --> 00:15:44,230 after the first guy, cache miss, I had to go bring it, 328 00:15:44,230 --> 00:15:46,240 and the next 15, I have brought to it. 329 00:15:46,240 --> 00:15:47,360 It's in the cache. 330 00:15:47,360 --> 00:15:51,170 So what should be my cost? 331 00:15:51,170 --> 00:15:51,880 [UNINTELLIGIBLE] 332 00:15:51,880 --> 00:15:53,880 cost of memory access. 333 00:15:57,840 --> 00:16:03,715 I am accessing s data, and 1/16th of that is a cache 334 00:16:03,715 --> 00:16:07,320 miss, 15/16th is a cache hit. 335 00:16:07,320 --> 00:16:09,320 So basically total [UNINTELLIGIBLE] 336 00:16:09,320 --> 00:16:15,970 is a 15/16th of s is a cache hit, and other 1/16th I have 337 00:16:15,970 --> 00:16:21,970 100 times cycles I need because it's a cache miss. 338 00:16:21,970 --> 00:16:24,210 Everybody good so far? 339 00:16:24,210 --> 00:16:27,110 OK, so what type of locality do we have here? 340 00:16:29,760 --> 00:16:31,090 First of all, do we have locality? 341 00:16:33,870 --> 00:16:36,231 Who thinks we have locality? 342 00:16:36,231 --> 00:16:38,480 OK, lot of people thinks we have locality. 343 00:16:38,480 --> 00:16:40,400 What type? 344 00:16:40,400 --> 00:16:41,370 AUDIENCE: [INAUDIBLE] 345 00:16:41,370 --> 00:16:44,090 PROFESSOR: Spatial locality because what we are doing is 346 00:16:44,090 --> 00:16:47,030 we are accessing the nearby elements even though we are 347 00:16:47,030 --> 00:16:49,870 not getting back any of the data yet, so we get some 348 00:16:49,870 --> 00:16:51,810 spacial locality axis here. 349 00:16:51,810 --> 00:16:55,150 That's good, so this is our very -- most simple 350 00:16:55,150 --> 00:16:57,220 thing we can do. 351 00:16:57,220 --> 00:16:59,300 So what kind of misses are we going to have in the cache? 352 00:17:06,119 --> 00:17:07,869 Cold misses. 353 00:17:07,869 --> 00:17:09,859 I'm missing because I have never seen that data before, 354 00:17:09,859 --> 00:17:13,280 so it's a cold miss in here. 355 00:17:13,280 --> 00:17:15,550 OK, so that's that. 356 00:17:15,550 --> 00:17:17,700 Let's look at this one. 357 00:17:17,700 --> 00:17:18,950 I'm accessing A[0] 358 00:17:21,780 --> 00:17:23,030 s times. 359 00:17:25,690 --> 00:17:27,400 OK? 360 00:17:27,400 --> 00:17:28,852 What's the total access time? 361 00:17:40,950 --> 00:17:44,360 How many cache misses am I going to get? 362 00:17:44,360 --> 00:17:50,780 One, so I basically have 100 time for the first cache miss, 363 00:17:50,780 --> 00:17:53,740 and s minus 1 for the rest basically it's a hit. 364 00:17:53,740 --> 00:17:55,080 OK? 365 00:17:55,080 --> 00:17:56,760 That's good, so what kind of locality's this? 366 00:18:02,070 --> 00:18:03,030 Spatial or temporal. 367 00:18:03,030 --> 00:18:05,490 There's not too many choices. 368 00:18:05,490 --> 00:18:06,740 Or none. 369 00:18:09,140 --> 00:18:10,390 How many people think spatial? 370 00:18:12,850 --> 00:18:15,770 How many people think temporal? 371 00:18:15,770 --> 00:18:19,230 How many people said there's none locality? 372 00:18:19,230 --> 00:18:20,990 OK, that's a lot of temporal. 373 00:18:20,990 --> 00:18:24,210 OK, you want to get [UNINTELLIGIBLE]. 374 00:18:24,210 --> 00:18:26,060 That's temporal locality here because you're accessing to 375 00:18:26,060 --> 00:18:28,650 the same thing again and again and again, same data. 376 00:18:28,650 --> 00:18:28,910 So this is-- 377 00:18:28,910 --> 00:18:30,056 Oh, OK. 378 00:18:30,056 --> 00:18:32,350 Want to restart, OK. 379 00:18:32,350 --> 00:18:34,826 Wait a little. 380 00:18:34,826 --> 00:18:36,810 So this is only time we notice in a hurry. 381 00:18:36,810 --> 00:18:38,840 Every time, I go to do something, I get hourglass. 382 00:18:41,630 --> 00:18:45,190 So what kind of misses are we getting? 383 00:18:45,190 --> 00:18:46,440 Trick question. 384 00:18:54,150 --> 00:18:57,510 I got one cold miss, and that's about it. 385 00:18:57,510 --> 00:18:58,840 And the rest I don't have any misses. 386 00:18:58,840 --> 00:19:01,410 So if I have a miss, it's a cold miss. 387 00:19:01,410 --> 00:19:05,260 OK, so here I am doing something interesting. 388 00:19:05,260 --> 00:19:08,720 So what the heck is in this list? 389 00:19:08,720 --> 00:19:10,840 So I'm accessing A[i] 390 00:19:10,840 --> 00:19:11,840 [UNINTELLIGIBLE] 391 00:19:11,840 --> 00:19:14,860 this 2 to the power number, one shifted to the entire 392 00:19:14,860 --> 00:19:21,470 miss, 2 to the power N. I shifted between 4 and 13. 393 00:19:21,470 --> 00:19:22,500 What this 13? 394 00:19:22,500 --> 00:19:24,080 Why is 13? 395 00:19:24,080 --> 00:19:25,330 Less than 13. 396 00:19:29,590 --> 00:19:31,060 What's 2 the power of 13? 397 00:19:31,060 --> 00:19:32,310 I think I got this right. 398 00:19:37,625 --> 00:19:43,070 AUDIENCE: 8,192. 399 00:19:43,070 --> 00:19:44,370 PROFESSOR: What's 2 to the power of 13? 400 00:19:44,370 --> 00:19:45,270 AUDIENCE: 8,192. 401 00:19:45,270 --> 00:19:46,890 PROFESSOR: Yeah, 8K. 402 00:19:46,890 --> 00:19:50,330 And I have 32K. 403 00:19:50,330 --> 00:19:54,290 So 8K of 32 [UNINTELLIGIBLE]. 404 00:19:54,290 --> 00:19:59,680 8K of integers, which is each as 4 bytes is how much? 405 00:19:59,680 --> 00:20:00,140 32K. 406 00:20:00,140 --> 00:20:05,410 So what this says is everything I access should fit 407 00:20:05,410 --> 00:20:05,900 into the cache. 408 00:20:05,900 --> 00:20:07,080 I am accessing data like that. 409 00:20:07,080 --> 00:20:08,680 I'm going back and accessing data like that. 410 00:20:08,680 --> 00:20:11,520 I'm going back and accessing data like that, and it should 411 00:20:11,520 --> 00:20:12,700 all fit into the cache. 412 00:20:12,700 --> 00:20:14,270 Do you see what I do? 413 00:20:14,270 --> 00:20:18,370 I access up to 2 to the power of 18, and I go back and 414 00:20:18,370 --> 00:20:19,060 access that again. 415 00:20:19,060 --> 00:20:20,640 I'm just going back and back because 416 00:20:20,640 --> 00:20:23,150 of the model operation. 417 00:20:23,150 --> 00:20:25,670 Everybody see what's going on? 418 00:20:25,670 --> 00:20:27,740 So how many cache misses should I have? 419 00:20:40,916 --> 00:20:44,870 AUDIENCE: There are four cache [INAUDIBLE]. 420 00:20:44,870 --> 00:20:48,090 PROFESSOR: OK, how many cache lines I would be accessing if 421 00:20:48,090 --> 00:20:52,120 I am doing i1 2 to the power N? 422 00:20:54,875 --> 00:20:57,070 [i] 423 00:20:57,070 --> 00:21:00,974 mod 2 to the power N. 424 00:21:00,974 --> 00:21:02,224 AUDIENCE: [INAUDIBLE] 425 00:21:17,720 --> 00:21:20,240 PROFESSOR: So one miss for each axis line 426 00:21:20,240 --> 00:21:21,070 the first time around. 427 00:21:21,070 --> 00:21:23,290 Afterward, it's only in the cache. 428 00:21:23,290 --> 00:21:28,430 So how many axis lines? 429 00:21:28,430 --> 00:21:30,000 2 to the power N. 430 00:21:30,000 --> 00:21:30,770 Is that right? 431 00:21:30,770 --> 00:21:32,020 I think this is wrong. 432 00:21:36,087 --> 00:21:40,510 Oh, yeah, this is right because every axis-- 433 00:21:40,510 --> 00:21:44,676 So in the cache line, how many-- 434 00:21:44,676 --> 00:21:45,926 AUDIENCE: [INAUDIBLE] 435 00:21:48,150 --> 00:21:49,400 PROFESSOR: This is four. 436 00:21:51,730 --> 00:21:54,440 So there are 16 entries here. 437 00:21:57,760 --> 00:21:59,780 16 entries in here. 438 00:21:59,780 --> 00:22:05,470 OK, only one of them basically misses that, so that means if 439 00:22:05,470 --> 00:22:09,224 2 to the power N axises-- 440 00:22:09,224 --> 00:22:11,090 AUDIENCE: [INAUDIBLE] 441 00:22:11,090 --> 00:22:17,680 PROFESSOR: You are doing 2 to the power N axises before you 442 00:22:17,680 --> 00:22:18,720 go back in here. 443 00:22:18,720 --> 00:22:22,260 So what you're doing is you are-- 444 00:22:22,260 --> 00:22:27,320 you make this many axises before you go back again. 445 00:22:27,320 --> 00:22:28,910 OK? 446 00:22:28,910 --> 00:22:34,130 And how many axises assigned to be in the same cache line? 447 00:22:34,130 --> 00:22:37,490 You have 64 bytes in the cache. 448 00:22:37,490 --> 00:22:41,730 OK each axis is four. 449 00:22:41,730 --> 00:22:43,134 16. 450 00:22:43,134 --> 00:22:46,040 AUDIENCE: [INAUDIBLE] 451 00:22:46,040 --> 00:22:48,210 PROFESSOR: 16. 452 00:22:48,210 --> 00:22:49,430 So that's true [UNINTELLIGIBLE]. 453 00:22:49,430 --> 00:22:50,680 AUDIENCE: [INAUDIBLE] 454 00:22:53,410 --> 00:22:56,510 PROFESSOR: Yeah, but what I'm saying is I might not access 455 00:22:56,510 --> 00:22:57,250 the [UNINTELLIGIBLE] 456 00:22:57,250 --> 00:22:58,890 because if any small. 457 00:22:58,890 --> 00:23:00,760 I only access a certain amount of cache line, so I'm 458 00:23:00,760 --> 00:23:02,760 accessing a part of the cache. 459 00:23:02,760 --> 00:23:04,520 I might not go through the-- can everybody see this, how 460 00:23:04,520 --> 00:23:06,160 this is working? 461 00:23:06,160 --> 00:23:12,170 Because if I'm only accessing, we'll say, accessing one, if 462 00:23:12,170 --> 00:23:16,472 this is, we'll say, 2 to the power of 5. 463 00:23:16,472 --> 00:23:18,570 OK, I am not going to access the entire cache line. 464 00:23:18,570 --> 00:23:18,990 [UNINTELLIGIBLE] 465 00:23:18,990 --> 00:23:22,100 cache is I'm going to [UNINTELLIGIBLE] 466 00:23:22,100 --> 00:23:23,220 go through the cache. 467 00:23:23,220 --> 00:23:27,670 So this many cache lines I'm accessing. 468 00:23:27,670 --> 00:23:31,665 OK, and the first time I access that, I get a cache 469 00:23:31,665 --> 00:23:34,680 miss, and after that, it say everything is in the cache. 470 00:23:38,820 --> 00:23:39,460 Everybody's following me? 471 00:23:39,460 --> 00:23:41,650 Or are we like lost in here? 472 00:23:41,650 --> 00:23:44,540 How many people are lost? 473 00:23:44,540 --> 00:23:46,260 OK, let's do this. 474 00:23:49,430 --> 00:23:57,120 So what happens is if you have a modular 2 to the power N. 475 00:23:57,120 --> 00:24:03,390 What that means is I'm going to keep accessing one to 476 00:24:03,390 --> 00:24:07,900 somewhere 2 to the power N, and the next one, I am going 477 00:24:07,900 --> 00:24:09,150 back here again. 478 00:24:12,800 --> 00:24:14,580 That's my axises, basically, because it goes 479 00:24:14,580 --> 00:24:16,620 [UNINTELLIGIBLE] 480 00:24:16,620 --> 00:24:18,380 accessing to the power of N. 481 00:24:18,380 --> 00:24:23,840 I nicely said n is between 4 and 13. 482 00:24:23,840 --> 00:24:38,820 So n is 13 means the maximum I can do is 8K. 483 00:24:38,820 --> 00:24:40,070 8K increase. 484 00:24:42,340 --> 00:24:51,470 8K increase equals 32K kilobytes of memory. 485 00:24:54,540 --> 00:24:56,580 So that means you don't have any option of 486 00:24:56,580 --> 00:24:59,660 overwriting the cache. 487 00:24:59,660 --> 00:25:02,790 So you go to cache, [UNINTELLIGIBLE]. 488 00:25:02,790 --> 00:25:04,880 You don't wrap around in the cache. 489 00:25:04,880 --> 00:25:06,325 Do you see that? 490 00:25:06,325 --> 00:25:09,450 OK, you know wrap arounding is so bad, that means all these 491 00:25:09,450 --> 00:25:12,110 things is going to fit in the cache, and then when you get 492 00:25:12,110 --> 00:25:14,860 [UNINTELLIGIBLE], this is going to be in the cache 493 00:25:14,860 --> 00:25:22,630 because the data you access is smaller than the entire cache. 494 00:25:22,630 --> 00:25:25,760 OK, so what that means is only the first line has to have 495 00:25:25,760 --> 00:25:28,410 some cache misses. 496 00:25:28,410 --> 00:25:29,965 So how many cache misses you are going to have 497 00:25:29,965 --> 00:25:30,470 in the first line? 498 00:25:30,470 --> 00:25:32,790 Or any of the lines going to have cache misses because the 499 00:25:32,790 --> 00:25:34,430 first line still fit in the cache. 500 00:25:34,430 --> 00:25:34,970 [UNINTELLIGIBLE] 501 00:25:34,970 --> 00:25:35,280 anything. 502 00:25:35,280 --> 00:25:36,540 We don't need everything. 503 00:25:36,540 --> 00:25:38,410 How many cache misses have in here? 504 00:25:38,410 --> 00:25:45,410 The interesting thing is because these are four bytes, 505 00:25:45,410 --> 00:25:55,830 we have a 64-byte cache line, and each is four bytes. 506 00:25:58,390 --> 00:26:03,980 So that means I can do 16 axises every time I have a 507 00:26:03,980 --> 00:26:06,550 cache miss. 508 00:26:06,550 --> 00:26:13,280 OK, so up to here to here is 2 to the N. My cache misses I do 509 00:26:13,280 --> 00:26:13,960 [UNINTELLIGIBLE] 510 00:26:13,960 --> 00:26:14,690 basically. 511 00:26:14,690 --> 00:26:18,110 My cache misses is basically this much. 512 00:26:21,460 --> 00:26:23,280 Everything else is in the cache, and after the first 513 00:26:23,280 --> 00:26:24,810 line, everything else is in the cache again, 514 00:26:24,810 --> 00:26:27,030 [UNINTELLIGIBLE] nicely, got my working set 515 00:26:27,030 --> 00:26:28,140 to fit in the cache. 516 00:26:28,140 --> 00:26:30,180 I'm really happy. 517 00:26:30,180 --> 00:26:31,580 Everybody see this? 518 00:26:31,580 --> 00:26:34,530 How many people don't see it now? 519 00:26:34,530 --> 00:26:36,010 OK, good. 520 00:26:36,010 --> 00:26:39,380 So let's move to something a little bit more complicated. 521 00:26:39,380 --> 00:26:40,630 What kind of locality do we have? 522 00:26:45,698 --> 00:26:47,570 AUDIENCE: Temporal and spatial. 523 00:26:47,570 --> 00:26:47,830 PROFESSOR: Good. 524 00:26:47,830 --> 00:26:51,560 We have temporal and spatial. 525 00:26:51,560 --> 00:26:55,090 You have spatial because every time you go, you only get the 526 00:26:55,090 --> 00:26:56,060 first line of the cache. 527 00:26:56,060 --> 00:26:58,850 Rest is in the cache. 528 00:26:58,850 --> 00:27:01,720 From that, I got a 16x improvement because I have 529 00:27:01,720 --> 00:27:02,990 temporal locality. 530 00:27:02,990 --> 00:27:06,690 And since I only access when I go back to the data, after 531 00:27:06,690 --> 00:27:12,660 accessing this, since it's already in the cache, I get 532 00:27:12,660 --> 00:27:14,800 spatial locality because of the 16. 533 00:27:14,800 --> 00:27:16,600 16 things are in the cache, I am going through that. 534 00:27:16,600 --> 00:27:18,050 I get spatial locality. 535 00:27:18,050 --> 00:27:22,530 I get temporal locality because next time I access, 536 00:27:22,530 --> 00:27:23,580 it's still in the cache. 537 00:27:23,580 --> 00:27:26,380 I haven't taken anything out of the cache. 538 00:27:26,380 --> 00:27:27,930 OK, what kind of misses? 539 00:27:30,960 --> 00:27:33,950 Should be cold misses again. 540 00:27:33,950 --> 00:27:35,650 Cold misses because the first time I [UNINTELLIGIBLE] 541 00:27:35,650 --> 00:27:35,890 things. 542 00:27:35,890 --> 00:27:36,860 I haven't seen that data. 543 00:27:36,860 --> 00:27:41,460 After I get it, everything is in the cache. 544 00:27:41,460 --> 00:27:44,700 So now, here's interesting case. 545 00:27:44,700 --> 00:27:49,010 Now, I am doing 2 to the power N, where N is great than 14. 546 00:27:52,800 --> 00:27:55,120 Now, what happens? 547 00:27:55,120 --> 00:28:00,920 Now, what happens in our picture here is I am going-- 548 00:28:05,180 --> 00:28:08,220 So as you might cache is somewhere here. 549 00:28:08,220 --> 00:28:13,470 At this point, I fill out the cache. 550 00:28:13,470 --> 00:28:18,010 I still keep going, and minute I access something 551 00:28:18,010 --> 00:28:19,300 [UNINTELLIGIBLE] 552 00:28:19,300 --> 00:28:20,550 cache, what's going to happen? 553 00:28:24,630 --> 00:28:28,650 So I went through accessing 34 kilobyte. 554 00:28:28,650 --> 00:28:30,570 If I access the next one, what happens? 555 00:28:34,420 --> 00:28:38,130 OK, no, no, no shutting down, please. 556 00:28:38,130 --> 00:28:39,380 OK, what happened? 557 00:28:42,730 --> 00:28:45,940 So I am accessing more than I can fit into the cache. 558 00:28:45,940 --> 00:28:48,840 Next time access, what's going to happen? 559 00:28:48,840 --> 00:28:51,638 Anyone want to take a wild guess? 560 00:28:51,638 --> 00:28:52,888 AUDIENCE: [INAUDIBLE] 561 00:28:55,530 --> 00:28:57,450 PROFESSOR: [UNINTELLIGIBLE PHRASE] 562 00:28:57,450 --> 00:28:57,840 before that. 563 00:28:57,840 --> 00:28:58,440 [UNINTELLIGIBLE] 564 00:28:58,440 --> 00:28:59,260 in there. 565 00:28:59,260 --> 00:29:04,410 So what happens is then am I ever going to have any 566 00:29:04,410 --> 00:29:07,120 temporal locality? 567 00:29:07,120 --> 00:29:08,330 No. 568 00:29:08,330 --> 00:29:14,590 OK, so now, first access to each line basically misses 569 00:29:14,590 --> 00:29:18,720 forever because by the time I get back to the 570 00:29:18,720 --> 00:29:19,880 [UNINTELLIGIBLE], it's not there. 571 00:29:19,880 --> 00:29:21,720 So basically, it's gone out of the cache. 572 00:29:21,720 --> 00:29:26,060 So what that means is my total access time is basically every 573 00:29:26,060 --> 00:29:30,720 16th element, I am getting cache miss forever. 574 00:29:30,720 --> 00:29:33,890 So what that means, every 15th of 16, I have a hit. 575 00:29:33,890 --> 00:29:40,070 So what I have is 15 out of 16, I have a cache hit. 576 00:29:40,070 --> 00:29:44,440 1/16th of the time, I have a cache miss forever. 577 00:29:44,440 --> 00:29:45,930 So what kind of locality do I have now? 578 00:29:50,367 --> 00:29:52,250 I have spatial locality. 579 00:29:52,250 --> 00:29:53,520 I don't have any temporal locality. 580 00:29:53,520 --> 00:29:55,910 I don't ever go back to something in the cache again. 581 00:29:55,910 --> 00:29:58,280 So what I have is spatial locality, so what type of 582 00:29:58,280 --> 00:29:59,530 misses now do I have? 583 00:30:03,679 --> 00:30:06,090 AUDIENCE: [INAUDIBLE] 584 00:30:06,090 --> 00:30:08,112 PROFESSOR: Cold, right. 585 00:30:08,112 --> 00:30:09,438 AUDIENCE: Is it like shared misses? 586 00:30:09,438 --> 00:30:09,880 [INAUDIBLE] 587 00:30:09,880 --> 00:30:10,770 PROFESSOR: It's not shared. 588 00:30:10,770 --> 00:30:12,930 When you fill out the cache, you go back to 589 00:30:12,930 --> 00:30:14,080 [UNINTELLIGIBLE]. 590 00:30:14,080 --> 00:30:15,690 You fill the cache. 591 00:30:15,690 --> 00:30:17,828 So what type of miss is that? 592 00:30:17,828 --> 00:30:18,716 AUDIENCE: Capacity? 593 00:30:18,716 --> 00:30:19,770 PROFESSOR: Capacity miss. 594 00:30:19,770 --> 00:30:22,910 OK, so you have basically cold and capacity 595 00:30:22,910 --> 00:30:25,360 misses happening now. 596 00:30:25,360 --> 00:30:25,695 OK? 597 00:30:25,695 --> 00:30:26,518 AUDIENCE: One question. 598 00:30:26,518 --> 00:30:27,982 [INAUDIBLE] 599 00:30:27,982 --> 00:30:31,642 that you multiplied whenever you're trying to load from 600 00:30:31,642 --> 00:30:35,302 memory, is that like an arbitrary number, or is that 601 00:30:35,302 --> 00:30:36,766 the actual cost of going-- 602 00:30:36,766 --> 00:30:39,520 PROFESSOR: Every machine, you can get this beautiful table. 603 00:30:39,520 --> 00:30:42,860 I will go off what your machines have. 604 00:30:42,860 --> 00:30:45,020 We still have a number saying this is your miss 605 00:30:45,020 --> 00:30:46,130 [UNINTELLIGIBLE], this is the thing-- 606 00:30:46,130 --> 00:30:50,120 I mean, so some of these things you realize because of 607 00:30:50,120 --> 00:30:50,780 all of the complexity. 608 00:30:50,780 --> 00:30:53,590 It's not that nice and simple, but this, I just pull out of 609 00:30:53,590 --> 00:30:55,570 hat, and it's kind of normal. 610 00:30:59,860 --> 00:31:02,110 So what do I do, now? 611 00:31:02,110 --> 00:31:06,765 I am doing the mod, but I am multiplying [i] by 16. 612 00:31:10,150 --> 00:31:14,450 So now, I am accessing this value, this value, this value, 613 00:31:14,450 --> 00:31:15,550 this value in here. 614 00:31:15,550 --> 00:31:17,590 I'm accessing this value, this value, this value, this value. 615 00:31:17,590 --> 00:31:22,490 One value in each cache line. 616 00:31:22,490 --> 00:31:23,440 OK? 617 00:31:23,440 --> 00:31:25,530 How much cache misses do you think I'm going to get? 618 00:31:29,895 --> 00:31:31,145 AUDIENCE: [INAUDIBLE] 619 00:31:33,300 --> 00:31:37,250 PROFESSOR: 100 for every access is basically beginning 620 00:31:37,250 --> 00:31:41,980 a cache line, you're taking one value and if I go back to 621 00:31:41,980 --> 00:31:45,230 that value, I have already filled up the cache. 622 00:31:45,230 --> 00:31:49,370 So basically, I have first access in the cache. 623 00:31:49,370 --> 00:31:53,700 And what's your total access time? 624 00:31:53,700 --> 00:31:54,790 Anyone will take a guess? 625 00:31:54,790 --> 00:31:55,996 AUDIENCE: [INAUDIBLE] 626 00:31:55,996 --> 00:31:58,650 PROFESSOR: Yep, 100 times x because I have no-- 627 00:31:58,650 --> 00:32:00,470 OK, so this is a [UNINTELLIGIBLE] 628 00:32:00,470 --> 00:32:01,090 locality. 629 00:32:01,090 --> 00:32:03,150 That was clear. 630 00:32:03,150 --> 00:32:05,143 What kind of misses am I getting now? 631 00:32:05,143 --> 00:32:06,130 AUDIENCE: [INAUDIBLE] 632 00:32:06,130 --> 00:32:07,390 PROFESSOR: Now, I'm getting conflict misses. 633 00:32:07,390 --> 00:32:09,420 I get a cold miss at the beginning, first time around, 634 00:32:09,420 --> 00:32:11,040 and then every time I do something, I'm getting 635 00:32:11,040 --> 00:32:15,870 conflict miss because the thing is, now, this N, I am 636 00:32:15,870 --> 00:32:18,340 only accessing small amount of data. 637 00:32:18,340 --> 00:32:23,090 It doesn't fit in the cache, but I'm not still getting any 638 00:32:23,090 --> 00:32:26,055 kind of sharing, so I'm having conflict misses. 639 00:32:30,910 --> 00:32:32,540 [UNINTELLIGIBLE] second time, I guess I will 640 00:32:32,540 --> 00:32:33,500 probably jump over this. 641 00:32:33,500 --> 00:32:35,980 I just did OK, if I go random access, what happens? 642 00:32:35,980 --> 00:32:37,120 So let me jump over that. 643 00:32:37,120 --> 00:32:39,600 Actually can do a calculation to figure out how many, 644 00:32:39,600 --> 00:32:41,230 probabilistically, how much you might have 645 00:32:41,230 --> 00:32:41,860 and stuff like that. 646 00:32:41,860 --> 00:32:43,990 OK. 647 00:32:43,990 --> 00:32:48,790 So now, if you look at what's going on, when you have no 648 00:32:48,790 --> 00:32:51,280 locality, you'd have no locality. 649 00:32:51,280 --> 00:32:53,410 That's a pretty obvious statement. 650 00:32:53,410 --> 00:32:57,470 And then if you have spatial locality and no temporal 651 00:32:57,470 --> 00:32:59,560 locality, we are streaming data. 652 00:32:59,560 --> 00:33:02,150 That means you are going through the data, but you're 653 00:33:02,150 --> 00:33:04,910 not getting back to it fast enough, so I 654 00:33:04,910 --> 00:33:05,810 have temporal locality. 655 00:33:05,810 --> 00:33:09,060 So I stream through data, so that means I go through data 656 00:33:09,060 --> 00:33:11,170 in a streaming fashion. 657 00:33:11,170 --> 00:33:12,420 OK? 658 00:33:16,820 --> 00:33:21,530 So what we have is if working set fits in cache, you have 659 00:33:21,530 --> 00:33:22,640 InCache mode. 660 00:33:22,640 --> 00:33:25,920 If working set is too big for cache, you can get some 661 00:33:25,920 --> 00:33:28,550 streaming mode and still get some locality because we are 662 00:33:28,550 --> 00:33:30,060 bringing the lines in here. 663 00:33:30,060 --> 00:33:34,800 And you can do other things like optimizers like 664 00:33:34,800 --> 00:33:37,230 prefetching we'll get to. 665 00:33:37,230 --> 00:33:40,060 And other issues [UNINTELLIGIBLE] 666 00:33:40,060 --> 00:33:46,130 this last one, but to deal with cache axises. 667 00:33:46,130 --> 00:33:50,820 So if you have more than one axis-- so here what I have is 668 00:33:50,820 --> 00:33:53,040 too nice arrays. 669 00:33:53,040 --> 00:33:57,860 [UNINTELLIGIBLE] is 2 to the power size array 2 to the 19 670 00:33:57,860 --> 00:33:59,430 power arrays. 671 00:33:59,430 --> 00:34:02,940 OK, so now what happens when I put this in the cache? 672 00:34:02,940 --> 00:34:04,370 Look what happens in here. 673 00:34:04,370 --> 00:34:10,080 So what happens is this array get mapped in this one. 674 00:34:10,080 --> 00:34:13,969 This array get maps this nicely in here. 675 00:34:13,969 --> 00:34:17,580 So what happens is this line can only be in this cache 676 00:34:17,580 --> 00:34:20,820 line, and this line also only can be in 677 00:34:20,820 --> 00:34:24,310 this line of the cache. 678 00:34:24,310 --> 00:34:25,969 Do you see what's going on, now? 679 00:34:25,969 --> 00:34:30,739 So if you do this one, basically every time you 680 00:34:30,739 --> 00:34:32,194 access something-- 681 00:34:32,194 --> 00:34:36,130 See, assume I'm accessing this data item, so I'm doing A, B. 682 00:34:36,130 --> 00:34:38,370 I access A, I get this line. 683 00:34:38,370 --> 00:34:40,317 I access B, what happens? 684 00:34:40,317 --> 00:34:41,510 AUDIENCE: [INAUDIBLE] 685 00:34:41,510 --> 00:34:42,550 PROFESSOR: So this [UNINTELLIGIBLE] is gone. 686 00:34:42,550 --> 00:34:42,920 It is gone. 687 00:34:42,920 --> 00:34:44,600 Now, I am accessing the next one in that 688 00:34:44,600 --> 00:34:47,829 cache line we made. 689 00:34:47,829 --> 00:34:50,757 AUDIENCE: How do you know what cache line-- or how 690 00:34:50,757 --> 00:34:51,733 [INAUDIBLE] 691 00:34:51,733 --> 00:34:52,709 cache? 692 00:34:52,709 --> 00:34:53,699 [INAUDIBLE] 693 00:34:53,699 --> 00:34:57,200 PROFESSOR: Normally, in language like C, if people two 694 00:34:57,200 --> 00:35:00,080 arrays next to each other, they should be mapped next to 695 00:35:00,080 --> 00:35:01,830 each other in memory. 696 00:35:01,830 --> 00:35:04,290 And if you want to be more adventurous, you can go to the 697 00:35:04,290 --> 00:35:08,070 assembly and then see what their locations is. 698 00:35:08,070 --> 00:35:10,650 So if you put A, A next to each other, you normally kind 699 00:35:10,650 --> 00:35:13,000 of get it next to each other. 700 00:35:13,000 --> 00:35:15,540 So what this one is this is pretty bad. 701 00:35:15,540 --> 00:35:19,450 I should have gotten at least some spatial locality. 702 00:35:19,450 --> 00:35:20,585 I'm getting nothing. 703 00:35:20,585 --> 00:35:22,000 Do you see what's going on here? 704 00:35:26,140 --> 00:35:28,035 Do you know what's going on here? 705 00:35:28,035 --> 00:35:30,770 Know why I'm not getting any spatial locality? 706 00:35:30,770 --> 00:35:31,940 Because I am bouncing. 707 00:35:31,940 --> 00:35:34,410 When I access this one, this one is gone even though I have 708 00:35:34,410 --> 00:35:36,960 more data in the same cache line, so I'm kind of bouncing 709 00:35:36,960 --> 00:35:38,390 two cache lines because of that. 710 00:35:38,390 --> 00:35:40,600 What's a good solution for this? 711 00:35:40,600 --> 00:35:41,550 Back there. 712 00:35:41,550 --> 00:35:42,500 AUDIENCE: [INAUDIBLE] 713 00:35:42,500 --> 00:35:47,120 A and B could be mapped to the same cache line. 714 00:35:47,120 --> 00:35:49,330 PROFESSOR: The thing A is a nice 2 to the power. 715 00:35:49,330 --> 00:35:54,422 A size is a multiple of the cache size. 716 00:35:54,422 --> 00:35:54,860 AUDIENCE: [INAUDIBLE] 717 00:35:54,860 --> 00:35:59,140 PROFESSOR: The size of A is a multiple of the cache size. 718 00:35:59,140 --> 00:36:04,480 So what happens is if you have memory in here, you map 719 00:36:04,480 --> 00:36:06,775 something like A, C here. 720 00:36:06,775 --> 00:36:13,960 If this is a multiple of cache line, so assume this is 32K 721 00:36:13,960 --> 00:36:18,450 times some number, and you map B here, and then this is 722 00:36:18,450 --> 00:36:22,590 normally what happens in C. You put all adjacent map 723 00:36:22,590 --> 00:36:25,150 allocations next to each other in memory. 724 00:36:25,150 --> 00:36:27,830 Do you got A here, and you start having B in 725 00:36:27,830 --> 00:36:30,130 here, next to it. 726 00:36:30,130 --> 00:36:34,840 OK, so what happens is now if A starts to assume address, 727 00:36:34,840 --> 00:36:47,710 we'll say 100, this is 100 plus 32K times N, basically. 728 00:36:47,710 --> 00:36:55,680 And you do this mod 32K, this mod 32K, this same number. 729 00:36:55,680 --> 00:36:58,280 It maps into the same line. 730 00:36:58,280 --> 00:37:01,740 OK, so I kind of gave you the problem. 731 00:37:01,740 --> 00:37:03,752 What can be the solution? 732 00:37:03,752 --> 00:37:05,135 AUDIENCE: [INAUDIBLE] 733 00:37:05,135 --> 00:37:06,980 allocate one line. 734 00:37:06,980 --> 00:37:08,280 PROFESSOR: Yes, it's called padding. 735 00:37:12,530 --> 00:37:14,950 OK, so I have no locality in here, let me go to the next 736 00:37:14,950 --> 00:37:15,920 slide, what kind of misses? 737 00:37:15,920 --> 00:37:17,500 I have cold and conflict. 738 00:37:17,500 --> 00:37:19,710 What I can do is just basically add a 739 00:37:19,710 --> 00:37:21,180 little bit of padding. 740 00:37:21,180 --> 00:37:22,320 I did 16 here. 741 00:37:22,320 --> 00:37:26,400 Normally, you do a prime number so it will not clash 742 00:37:26,400 --> 00:37:26,910 with anything. 743 00:37:26,910 --> 00:37:31,350 Add a little bit of padding at the end of array, and that 744 00:37:31,350 --> 00:37:33,230 means the next array will not be 745 00:37:33,230 --> 00:37:35,310 conflicted most of the time. 746 00:37:35,310 --> 00:37:37,710 So a lot of times when you declare things next to each 747 00:37:37,710 --> 00:37:39,630 other, it's always good to add a little bit of 748 00:37:39,630 --> 00:37:41,450 a padding in between. 749 00:37:41,450 --> 00:37:46,010 Normally, a smaller prime number would be a good thing 750 00:37:46,010 --> 00:37:47,260 to add a padding. 751 00:37:50,620 --> 00:37:54,180 Now, I start getting back my nice locality because the two 752 00:37:54,180 --> 00:37:57,460 things are mapping the two cache lines. 753 00:37:57,460 --> 00:37:58,920 What type of locality do I get here? 754 00:38:05,960 --> 00:38:07,470 Anybody want to take any guess what type of 755 00:38:07,470 --> 00:38:08,720 locality I have here? 756 00:38:11,816 --> 00:38:13,220 AUDIENCE: Question, sorry. 757 00:38:13,220 --> 00:38:14,954 [INAUDIBLE] 758 00:38:14,954 --> 00:38:16,942 you actually added just an extra 16. 759 00:38:16,942 --> 00:38:17,439 PROFESSOR: Yes. 760 00:38:17,439 --> 00:38:21,420 AUDIENCE: Why would that just make the A and B [INAUDIBLE]? 761 00:38:21,420 --> 00:38:24,640 PROFESSOR: Because what happens is normally it doesn't 762 00:38:24,640 --> 00:38:29,830 make A and B into leaving the cache, but it makes A[0] 763 00:38:29,830 --> 00:38:33,800 not exactly matching to B[0]. 764 00:38:33,800 --> 00:38:35,200 A[0] 765 00:38:35,200 --> 00:38:39,670 will interleave with something like B[16] 766 00:38:39,670 --> 00:38:40,670 or something. 767 00:38:40,670 --> 00:38:43,232 This thing might map into the same place. 768 00:38:43,232 --> 00:38:47,480 AUDIENCE: I mean, but I [INAUDIBLE]. 769 00:38:47,480 --> 00:38:49,220 PROFESSOR: Yes, normally in a computation, 770 00:38:49,220 --> 00:38:50,120 people do that, A[i] 771 00:38:50,120 --> 00:38:50,670 equals B[i] 772 00:38:50,670 --> 00:38:52,246 plus something. 773 00:38:52,246 --> 00:38:53,725 AUDIENCE: Oh, OK, so [INAUDIBLE]. 774 00:38:57,669 --> 00:38:59,312 Like I understand what you're saying, but I don't understand 775 00:38:59,312 --> 00:39:00,627 how that [INAUDIBLE] 776 00:39:00,627 --> 00:39:03,585 because they're both accessing [i] at the same time. 777 00:39:03,585 --> 00:39:04,835 Why would they [INAUDIBLE]? 778 00:39:09,980 --> 00:39:13,290 PROFESSOR: So here's the problem. 779 00:39:13,290 --> 00:39:14,840 Assume I have [UNINTELLIGIBLE] 780 00:39:14,840 --> 00:39:16,710 32K. 781 00:39:16,710 --> 00:39:18,180 I'm assuming [UNINTELLIGIBLE]. 782 00:39:18,180 --> 00:39:25,600 OK, so first one is, we'll say, you start with 100, 100 783 00:39:25,600 --> 00:39:36,400 plus [i], and the other one is 100 plus 32K plus [i]. 784 00:39:36,400 --> 00:39:46,250 OK, then you take a mod 32K, and this end up 785 00:39:46,250 --> 00:39:50,740 being 100 plus [i] 786 00:39:50,740 --> 00:39:53,110 with another 100 plus [i]. 787 00:39:53,110 --> 00:39:55,760 So this is basically [UNINTELLIGIBLE]. 788 00:39:55,760 --> 00:39:58,950 You still have one or one element in the cache it's all 789 00:39:58,950 --> 00:40:01,890 going to map into. 790 00:40:01,890 --> 00:40:03,390 You see that? 791 00:40:03,390 --> 00:40:11,556 But now, if I add 16 more to this one, then this map into 792 00:40:11,556 --> 00:40:15,816 116 plus [i]. 793 00:40:15,816 --> 00:40:18,286 AUDIENCE: I don't see that reflected in the code. 794 00:40:18,286 --> 00:40:21,250 You just [INAUDIBLE], so how is B different 795 00:40:21,250 --> 00:40:22,240 from A in your code? 796 00:40:22,240 --> 00:40:24,450 PROFESSOR: Because [UNINTELLIGIBLE] my S. I added 797 00:40:24,450 --> 00:40:28,488 16 to S, so my A is a little bit bigger, now. 798 00:40:28,488 --> 00:40:30,280 AUDIENCE: But isn't B also [INAUDIBLE]? 799 00:40:30,280 --> 00:40:32,240 PROFESSOR: Yeah, B's also bigger here. 800 00:40:32,240 --> 00:40:32,870 I don't care. 801 00:40:32,870 --> 00:40:34,110 B goes down here. 802 00:40:34,110 --> 00:40:37,780 B, I add a little bit more to the B. It doesn't matter. 803 00:40:37,780 --> 00:40:40,010 I padded B, too. 804 00:40:40,010 --> 00:40:41,790 So what that means is the first A[i] 805 00:40:41,790 --> 00:40:45,282 and B[i] are not in the same cache line. 806 00:40:45,282 --> 00:40:47,505 AUDIENCE: How do we allocate A and B right 807 00:40:47,505 --> 00:40:48,740 next to each other? 808 00:40:48,740 --> 00:40:57,270 PROFESSOR: So normally in C, if you declare something like 809 00:40:57,270 --> 00:41:08,010 int A [100], int B [100], more or less, they will be 810 00:41:08,010 --> 00:41:09,700 allocated next to each other. 811 00:41:09,700 --> 00:41:12,620 If you look at the assembly listing, it'll say code 812 00:41:12,620 --> 00:41:15,640 allocated next to each other in memory. 813 00:41:15,640 --> 00:41:19,480 Even in the stack, if you do that, if you allocate things, 814 00:41:19,480 --> 00:41:24,265 compiler has no incentive to go and move things around. 815 00:41:24,265 --> 00:41:27,850 AUDIENCE: This is only for global and [INAUDIBLE]? 816 00:41:27,850 --> 00:41:30,172 PROFESSOR: Global variable, local variables. 817 00:41:30,172 --> 00:41:32,860 AUDIENCE: Does any of it still apply if you use malloc? 818 00:41:38,520 --> 00:41:39,770 PROFESSOR: The thing about malloc is-- 819 00:41:43,164 --> 00:41:44,150 AUDIENCE: [INAUDIBLE] 820 00:41:44,150 --> 00:41:45,612 PROFESSOR: It might do something [UNINTELLIGIBLE], 821 00:41:45,612 --> 00:41:48,760 but on the other hand, if it is fitting on the same page, 822 00:41:48,760 --> 00:41:50,460 it might also do something like that, too. 823 00:41:50,460 --> 00:41:53,120 So it depends on how we do that. 824 00:41:53,120 --> 00:41:56,400 At the beginning, if you keep mallocing things, it might be 825 00:41:56,400 --> 00:41:56,870 [UNINTELLIGIBLE]. 826 00:41:56,870 --> 00:42:00,660 They might be a little bit off because malloc put some 827 00:42:00,660 --> 00:42:02,920 metadata with each status. 828 00:42:02,920 --> 00:42:04,290 If you ask for [UNINTELLIGIBLE], you're not 829 00:42:04,290 --> 00:42:05,260 getting exactly 100. 830 00:42:05,260 --> 00:42:06,740 You're getting a little bit of a bigger size. 831 00:42:06,740 --> 00:42:09,930 But you might have some configuring to do, but after 832 00:42:09,930 --> 00:42:12,710 some time, when you have done a lot of [UNINTELLIGIBLE], we 833 00:42:12,710 --> 00:42:13,810 have no idea where it's going to go. 834 00:42:13,810 --> 00:42:15,290 It's random. 835 00:42:15,290 --> 00:42:17,530 Anymore questions? 836 00:42:17,530 --> 00:42:20,940 OK, good, that is a good question. 837 00:42:20,940 --> 00:42:22,790 So what kind of locality do I have here? 838 00:42:29,030 --> 00:42:29,860 [UNINTELLIGIBLE] 839 00:42:29,860 --> 00:42:37,130 two lines, I am accessing each, and I got 16 times 15 840 00:42:37,130 --> 00:42:38,380 out of 16 hits. 841 00:42:41,570 --> 00:42:42,650 What's that? 842 00:42:42,650 --> 00:42:43,540 AUDIENCE: Spatial? 843 00:42:43,540 --> 00:42:45,400 PROFESSOR: Spatial locality. 844 00:42:45,400 --> 00:42:47,990 And of course, if you have spatial locality, you get cold 845 00:42:47,990 --> 00:42:48,850 misses, basically. 846 00:42:48,850 --> 00:42:51,170 And I think [UNINTELLIGIBLE], you get what 847 00:42:51,170 --> 00:42:54,605 other type of misses? 848 00:42:54,605 --> 00:42:56,569 AUDIENCE: [INAUDIBLE] 849 00:42:56,569 --> 00:42:57,551 PROFESSOR: [UNINTELLIGIBLE] conflict here. 850 00:42:57,551 --> 00:42:58,540 Basically, capacity. 851 00:42:58,540 --> 00:43:01,260 Capacity miss, exactly. 852 00:43:01,260 --> 00:43:08,260 So in order to avoid this, what we have done is we make 853 00:43:08,260 --> 00:43:10,940 cache-- instead of mapping into one place before, and 854 00:43:10,940 --> 00:43:14,800 what we said was we had this memory, and 855 00:43:14,800 --> 00:43:16,160 assume we have 32 sets. 856 00:43:16,160 --> 00:43:20,270 And if you have a cache like this, we make this entire map 857 00:43:20,270 --> 00:43:24,100 into this one, and this map into this one. 858 00:43:24,100 --> 00:43:27,250 What people have done is instead of doing that, what 859 00:43:27,250 --> 00:43:30,500 you can make is make the cache, instead of one big 860 00:43:30,500 --> 00:43:34,436 block, two small blocks. 861 00:43:34,436 --> 00:43:40,890 So what that means is this can happen to this one. 862 00:43:43,670 --> 00:43:46,150 My drawings are not that great, but I 863 00:43:46,150 --> 00:43:47,000 hope you get the picture. 864 00:43:47,000 --> 00:43:50,310 This map into this one, so what that means is each 865 00:43:50,310 --> 00:43:55,630 reference here, so maps here, and these 866 00:43:55,630 --> 00:43:56,940 two basically conflicts. 867 00:43:56,940 --> 00:43:58,690 But now with that [UNINTELLIGIBLE] 868 00:43:58,690 --> 00:44:01,610 too, I can put this one here or here. 869 00:44:01,610 --> 00:44:05,030 I have two places to put the value, basically, and that 870 00:44:05,030 --> 00:44:07,480 [UNINTELLIGIBLE]. 871 00:44:07,480 --> 00:44:11,260 OK, so that means if I access something conflicting, 872 00:44:11,260 --> 00:44:14,960 [UNINTELLIGIBLE], you cannot just two things. 873 00:44:14,960 --> 00:44:17,860 Still, there are places in the cache for them to go. 874 00:44:17,860 --> 00:44:20,510 And one way to do people is fully associative cache. 875 00:44:20,510 --> 00:44:23,550 It can go anywhere, but that's very hard to do in hardware, 876 00:44:23,550 --> 00:44:26,560 but you can do some small number of ways. 877 00:44:26,560 --> 00:44:30,480 So in here, if you do two basic associative cache, in 878 00:44:30,480 --> 00:44:35,610 the first axis in here, even though these are conflicted, 879 00:44:35,610 --> 00:44:37,090 on can go to one [UNINTELLIGIBLE] set, another 880 00:44:37,090 --> 00:44:38,340 one can go to the other set. 881 00:44:41,070 --> 00:44:44,640 OK, so two associative cache [UNINTELLIGIBLE]. 882 00:44:44,640 --> 00:44:48,780 So total access time, again, I have nice locality in here, 883 00:44:48,780 --> 00:44:53,570 spatial locality, and I have both cold and, I guess, 884 00:44:53,570 --> 00:44:56,490 capacity misses. 885 00:44:56,490 --> 00:44:58,740 But if you have two associative cache, what's the 886 00:44:58,740 --> 00:44:59,890 next problem? 887 00:44:59,890 --> 00:45:01,310 How about if you have three different things? 888 00:45:04,280 --> 00:45:06,070 So we have two different things like that. 889 00:45:06,070 --> 00:45:07,730 OK, now that I only have two associated [UNINTELLIGIBLE], 890 00:45:07,730 --> 00:45:10,630 it doesn't work because, in a [UNINTELLIGIBLE], I will 891 00:45:10,630 --> 00:45:13,290 replace the last thing, and you might still end up with 892 00:45:13,290 --> 00:45:15,830 nothing in here because you are replacing nothing, and the 893 00:45:15,830 --> 00:45:17,290 next thing replaced the other thing. 894 00:45:17,290 --> 00:45:20,450 And basically after you gave B, C, then you go 895 00:45:20,450 --> 00:45:21,450 to A. A Is not there. 896 00:45:21,450 --> 00:45:23,040 When you go to B, B is not there, C is not there. 897 00:45:23,040 --> 00:45:26,560 You kind of go back to the same problem in here, and then 898 00:45:26,560 --> 00:45:28,285 basically have to access everything again. 899 00:45:33,570 --> 00:45:35,750 There is some spatial locality but cache can't use it. 900 00:45:35,750 --> 00:45:39,190 Basically, you have conflict misses. 901 00:45:39,190 --> 00:45:40,640 OK, so you see that? 902 00:45:40,640 --> 00:45:44,410 So associative is good because normally people assume you 903 00:45:44,410 --> 00:45:46,080 don't need that much 904 00:45:46,080 --> 00:45:49,590 associativity, but it's limited. 905 00:45:49,590 --> 00:45:51,430 So if you have too many things, you might run into 906 00:45:51,430 --> 00:45:54,416 this problem again. 907 00:45:54,416 --> 00:45:55,666 AUDIENCE: [INAUDIBLE] 908 00:46:02,320 --> 00:46:03,308 PROFESSOR: You mean capacity? 909 00:46:03,308 --> 00:46:04,790 AUDIENCE: Yeah, [INAUDIBLE] capacity. 910 00:46:04,790 --> 00:46:05,778 PROFESSOR: Less. 911 00:46:05,778 --> 00:46:07,754 AUDIENCE: [INAUDIBLE] 912 00:46:07,754 --> 00:46:10,718 elements in a cache before you pick [INAUDIBLE] because it 913 00:46:10,718 --> 00:46:14,670 looks as though the matrix is is using half 914 00:46:14,670 --> 00:46:15,658 the cache by itself. 915 00:46:15,658 --> 00:46:17,510 PROFESSOR: Because it's associated, you mean? 916 00:46:17,510 --> 00:46:18,480 You have multiple there? 917 00:46:18,480 --> 00:46:21,030 I mean, the thing is since it's associated, no, because 918 00:46:21,030 --> 00:46:22,050 you have enough room. 919 00:46:22,050 --> 00:46:23,260 I mean, if you're going [UNINTELLIGIBLE] 920 00:46:23,260 --> 00:46:25,360 everything, you have two places to put the 921 00:46:25,360 --> 00:46:28,730 name because it maps. 922 00:46:28,730 --> 00:46:29,950 That's a good question. 923 00:46:29,950 --> 00:46:35,600 So what happens is now, assume this many things fit into the 924 00:46:35,600 --> 00:46:37,310 cache before. 925 00:46:37,310 --> 00:46:40,610 Now, cache has two banks, and in each bank, only this many 926 00:46:40,610 --> 00:46:42,095 things fit in, but the next thing can go 927 00:46:42,095 --> 00:46:43,080 to the other bank. 928 00:46:43,080 --> 00:46:44,550 So if you go through [UNINTELLIGIBLE], you have the 929 00:46:44,550 --> 00:46:47,620 same behavior even if you have associativity. 930 00:46:47,620 --> 00:46:48,890 OK, you can go and-- 931 00:46:48,890 --> 00:46:52,240 because if the cache is the same size, the size is what 932 00:46:52,240 --> 00:46:54,767 matters for if you are going through [UNINTELLIGIBLE] axis. 933 00:46:54,767 --> 00:46:56,552 AUDIENCE: So far, [INAUDIBLE] machines that we're working 934 00:46:56,552 --> 00:47:00,773 on, do they have a fixed kind of cache, or 935 00:47:00,773 --> 00:47:01,585 can we decide like-- 936 00:47:01,585 --> 00:47:02,940 PROFESSOR: The fixed kind of cache. 937 00:47:02,940 --> 00:47:04,800 It's built into the hardware. 938 00:47:04,800 --> 00:47:06,650 I will show the table to say what type of 939 00:47:06,650 --> 00:47:09,346 associativity is in there. 940 00:47:09,346 --> 00:47:12,818 AUDIENCE: Is there not a way to say [INAUDIBLE] 941 00:47:12,818 --> 00:47:14,306 three-way associative? 942 00:47:14,306 --> 00:47:16,786 Can you have something on top of it so that it looks like 943 00:47:16,786 --> 00:47:18,770 it's two-way associative or something? 944 00:47:18,770 --> 00:47:22,790 PROFESSOR: I mean, you can do things like that in software, 945 00:47:22,790 --> 00:47:23,810 but they're very expensive. 946 00:47:23,810 --> 00:47:26,960 I mean, think about cache is trying to fool the cache in 947 00:47:26,960 --> 00:47:30,710 that direction cannot be that easy because they are like 948 00:47:30,710 --> 00:47:32,470 really hard-baked into hardware. that's a good 949 00:47:32,470 --> 00:47:35,650 question, but there are things you can do, not exactly 950 00:47:35,650 --> 00:47:38,690 modeling something, but kind of putting offsets and stuff 951 00:47:38,690 --> 00:47:41,800 like that so it doesn't get into these kind of conflicts. 952 00:47:46,140 --> 00:47:47,520 So yeah, I will answer this thing. 953 00:47:47,520 --> 00:47:48,340 Why don't you have more? 954 00:47:48,340 --> 00:47:51,950 Because if you go back to your [UNINTELLIGIBLE] 955 00:47:51,950 --> 00:47:52,530 lectures. 956 00:47:52,530 --> 00:47:54,720 Perhaps it might have explanation in it's hardware 957 00:47:54,720 --> 00:47:56,590 why it's much harder to build. 958 00:47:56,590 --> 00:48:00,730 And if you have the linked lists now-- 959 00:48:00,730 --> 00:48:06,674 so assume I have linked lists we are going through in here. 960 00:48:06,674 --> 00:48:12,450 And what's my [UNINTELLIGIBLE]? 961 00:48:12,450 --> 00:48:15,220 Basically, depending on allocation and using some 962 00:48:15,220 --> 00:48:17,980 things, so best case is everything is in cache, so I 963 00:48:17,980 --> 00:48:19,400 have a small list. 964 00:48:19,400 --> 00:48:22,760 And I go through again and again, round and around my 965 00:48:22,760 --> 00:48:26,280 list, everything's in cache and [UNINTELLIGIBLE] 966 00:48:26,280 --> 00:48:29,350 S. That's really great. 967 00:48:29,350 --> 00:48:32,130 Next best is everything is adjacent. 968 00:48:32,130 --> 00:48:34,830 OK, everything might not be in the cache, but everything at 969 00:48:34,830 --> 00:48:35,760 least in adjacent memory. 970 00:48:35,760 --> 00:48:40,120 So basically, first time I allocate my malloc did a nice 971 00:48:40,120 --> 00:48:45,480 thing for me, and at that time, I basically get a nice 972 00:48:45,480 --> 00:48:49,510 basic spatial locality since my size is 16 in here. 973 00:48:49,510 --> 00:48:51,180 So I get this much. 974 00:48:51,180 --> 00:48:53,390 And the worst case is random. 975 00:48:53,390 --> 00:48:55,320 Malloc puts it all over the place. 976 00:48:55,320 --> 00:48:55,660 Question? 977 00:48:55,660 --> 00:48:57,127 AUDIENCE: [INAUDIBLE] 978 00:48:57,127 --> 00:49:01,283 cache, does the thing behind it [UNINTELLIGIBLE] in front 979 00:49:01,283 --> 00:49:02,506 of it [UNINTELLIGIBLE] also going to? 980 00:49:02,506 --> 00:49:06,390 Do those both have a spatial locality? 981 00:49:06,390 --> 00:49:16,290 PROFESSOR: What happens is normally, we assume 64 element 982 00:49:16,290 --> 00:49:24,873 cache, so it's at that boundary, so basically 2 to 983 00:49:24,873 --> 00:49:26,640 the 8 boundaries is what happens. 984 00:49:26,640 --> 00:49:33,830 So what happens is cache line begins at 0 address, 64 985 00:49:33,830 --> 00:49:34,310 address here. 986 00:49:34,310 --> 00:49:38,570 So if you get 63, that means this value is not in the 987 00:49:38,570 --> 00:49:42,120 cache, so you get this cache line. 988 00:49:42,120 --> 00:49:43,680 So the cache line doesn't shift. 989 00:49:43,680 --> 00:49:48,840 It's fixed chunks in here, so that if you get address 0, you 990 00:49:48,840 --> 00:49:50,630 get this one, and all these things. 991 00:49:50,630 --> 00:49:52,580 If you get something in the middle, you'll get the left 992 00:49:52,580 --> 00:49:54,300 and right [UNINTELLIGIBLE], but if you get here, you only 993 00:49:54,300 --> 00:49:55,790 get the right. 994 00:49:55,790 --> 00:49:58,550 So this basically 0 and 64, they are boundaries 995 00:49:58,550 --> 00:49:58,820 [INAUDIBLE]. 996 00:49:58,820 --> 00:50:00,430 If you like 65, that's it. 997 00:50:00,430 --> 00:50:03,640 But that said, if you're accessing [UNINTELLIGIBLE], 998 00:50:03,640 --> 00:50:05,790 that's this thing called prefetcher that say you are 999 00:50:05,790 --> 00:50:08,070 going through linear pattern, and you might need that next, 1000 00:50:08,070 --> 00:50:10,120 and it'll bring this up for you. 1001 00:50:10,120 --> 00:50:15,470 And what you find, I'll show later, that in fact, the 1002 00:50:15,470 --> 00:50:19,110 machine you're using has a really powerful prefetcher, so 1003 00:50:19,110 --> 00:50:20,270 it can hide a lot of these things. 1004 00:50:20,270 --> 00:50:21,520 So we'll get to that. 1005 00:50:25,420 --> 00:50:27,940 So this is my total access time in here. 1006 00:50:27,940 --> 00:50:28,820 Random case. 1007 00:50:28,820 --> 00:50:31,670 Random case is basically, every time I access something, 1008 00:50:31,670 --> 00:50:32,920 it's not in the cache. 1009 00:50:38,340 --> 00:50:42,720 And then if you have a bigger struct in here-- 1010 00:50:42,720 --> 00:50:45,690 but assume I have this big struct in here, but I am only 1011 00:50:45,690 --> 00:50:50,830 accessing one element of this struct in here. 1012 00:50:50,830 --> 00:50:51,400 I do a lot. 1013 00:50:51,400 --> 00:50:55,910 I created this large struct, and I am [UNINTELLIGIBLE] 1014 00:50:55,910 --> 00:50:58,120 things in the struct, and I just basically 1015 00:50:58,120 --> 00:51:00,750 looking at my data. 1016 00:51:00,750 --> 00:51:02,310 Only one element. 1017 00:51:02,310 --> 00:51:03,560 What's the problem here? 1018 00:51:18,420 --> 00:51:19,670 AUDIENCE: [INAUDIBLE] 1019 00:51:22,340 --> 00:51:25,180 PROFESSOR: Exactly, so I bring the entire cache line, but I'm 1020 00:51:25,180 --> 00:51:27,220 only using a little bit of it. 1021 00:51:27,220 --> 00:51:30,120 Because I bring all these things into the cache, it 1022 00:51:30,120 --> 00:51:33,260 comes with axis and that, but I'm only using data. 1023 00:51:33,260 --> 00:51:36,720 So basically, what happens is, if everything in the cache, I 1024 00:51:36,720 --> 00:51:38,440 have access time S. That's great. 1025 00:51:38,440 --> 00:51:40,990 That's not a problem, [UNINTELLIGIBLE] 1026 00:51:40,990 --> 00:51:42,240 streaming. 1027 00:51:44,990 --> 00:51:47,920 Even though I'm getting spatial locality, I only get 1028 00:51:47,920 --> 00:51:50,810 half of it because even though I'm bringing the cache line, 1029 00:51:50,810 --> 00:51:54,630 only two have datas because this is 32 bytes long. 1030 00:51:54,630 --> 00:51:57,150 It's in my cache. 1031 00:51:57,150 --> 00:52:01,910 Even though I am bringing all this data, I only two of them 1032 00:52:01,910 --> 00:52:04,600 in the cache because rest of this other data is kind of 1033 00:52:04,600 --> 00:52:07,370 hanging in there with me, and so I don't get that. 1034 00:52:07,370 --> 00:52:10,610 And random, of course, you have no locality. 1035 00:52:10,610 --> 00:52:13,320 So how can I basically make this better? 1036 00:52:16,010 --> 00:52:17,010 How can I make this better? 1037 00:52:17,010 --> 00:52:19,520 OK, I just showed it very fast. 1038 00:52:19,520 --> 00:52:20,263 OK. 1039 00:52:20,263 --> 00:52:21,513 AUDIENCE: [INAUDIBLE] 1040 00:52:23,510 --> 00:52:26,350 PROFESSOR: So instead, if I ask you, I have this large 1041 00:52:26,350 --> 00:52:29,560 data structure, but most of time, I'm only accessing a 1042 00:52:29,560 --> 00:52:36,040 small part of it, this is kind of a little bit of a hack. 1043 00:52:36,040 --> 00:52:40,370 I mean, this is kind of going against, I guess 6005 kind of 1044 00:52:40,370 --> 00:52:42,120 building, but this actually works. 1045 00:52:42,120 --> 00:52:45,700 What you can do is, you can go back and instead of doing 1046 00:52:45,700 --> 00:52:48,010 structs, what you can do is you can do arrays 1047 00:52:48,010 --> 00:52:51,040 of each data item. 1048 00:52:51,040 --> 00:52:53,750 And now what happens is that I am basically bringing these, 1049 00:52:53,750 --> 00:52:56,050 and then the data I am accessing is basically right 1050 00:52:56,050 --> 00:52:57,300 now next to each other. 1051 00:52:59,700 --> 00:53:02,660 So you might have this very large data structure 1052 00:53:02,660 --> 00:53:04,880 representing something you have a lot of, then. 1053 00:53:04,880 --> 00:53:07,720 But most of the time, you're only using few of them, so you 1054 00:53:07,720 --> 00:53:09,050 don't have to be structs of arrays. 1055 00:53:09,050 --> 00:53:11,930 You might have couple of different data structures. 1056 00:53:11,930 --> 00:53:15,370 One is the one that you access regularly hopefully you get 1057 00:53:15,370 --> 00:53:16,460 them next to each other. 1058 00:53:16,460 --> 00:53:18,635 And the nice thing about arrays is it guarantees things 1059 00:53:18,635 --> 00:53:19,220 are next to each other. 1060 00:53:19,220 --> 00:53:21,144 Of course, was managing it is harder. 1061 00:53:21,144 --> 00:53:22,394 AUDIENCE: [UNINTELLIGIBLE] 1062 00:53:26,358 --> 00:53:27,275 PROFESSOR: You can do both. 1063 00:53:27,275 --> 00:53:30,260 So that means instead of doing the structs, you use arrays of 1064 00:53:30,260 --> 00:53:32,440 each field. 1065 00:53:32,440 --> 00:53:33,405 AUDIENCE: [INAUDIBLE] 1066 00:53:33,405 --> 00:53:34,495 PROFESSOR: Instead of. 1067 00:53:34,495 --> 00:53:37,249 So instead of doing this one, you just do this one. 1068 00:53:37,249 --> 00:53:38,247 AUDIENCE: [INAUDIBLE] 1069 00:53:38,247 --> 00:53:41,740 instead of being able to pass stuff over. 1070 00:53:41,740 --> 00:53:45,013 Now you have to kind of be able to know [UNINTELLIGIBLE], 1071 00:53:45,013 --> 00:53:48,288 so what that is where because you cannot just pass the 1072 00:53:48,288 --> 00:53:49,370 struct [UNINTELLIGIBLE]. 1073 00:53:49,370 --> 00:53:50,410 PROFESSOR: No, you can't pass the struct. 1074 00:53:50,410 --> 00:53:52,500 You basically have to pad the index in here for this 1075 00:53:52,500 --> 00:53:53,500 structure, basically. 1076 00:53:53,500 --> 00:53:54,660 So you have to know something in there. 1077 00:53:54,660 --> 00:53:58,870 So this is a lot messier, but it can actually give you a 1078 00:53:58,870 --> 00:54:00,990 performance if you do this type of access, so it's not 1079 00:54:00,990 --> 00:54:01,920 something you want everywhere. 1080 00:54:01,920 --> 00:54:06,130 That's why you still want to use structures and nice 1081 00:54:06,130 --> 00:54:06,830 classes and stuff. 1082 00:54:06,830 --> 00:54:11,580 But this, if you're doing very large data, but most of the 1083 00:54:11,580 --> 00:54:13,910 you're accessing only small amount, this kind of 1084 00:54:13,910 --> 00:54:17,750 representation can help. 1085 00:54:17,750 --> 00:54:20,850 So I want to get a little bit into matrix multiplying. 1086 00:54:20,850 --> 00:54:26,420 So what we're looking at is this four by four matrix, and 1087 00:54:26,420 --> 00:54:29,730 you can represent it using this representation. 1088 00:54:29,730 --> 00:54:33,150 But to make it clear, what I'm going to do is I'm just going 1089 00:54:33,150 --> 00:54:35,650 to allocate this entire thing as a single array, and I'm 1090 00:54:35,650 --> 00:54:37,100 going to do that [UNINTELLIGIBLE] calculation 1091 00:54:37,100 --> 00:54:43,080 myself because if you go a little bit down, you realize 1092 00:54:43,080 --> 00:54:46,910 this is allocated in a little bit of a weird way. 1093 00:54:46,910 --> 00:54:52,040 And so this is what we call row-major because all rows are 1094 00:54:52,040 --> 00:54:54,670 next to each other, and you can access allocated 1095 00:54:54,670 --> 00:54:55,870 column-major. 1096 00:54:55,870 --> 00:55:00,790 Basically that means now A001, instead of that, 1, 0 is next 1097 00:55:00,790 --> 00:55:04,180 to 0, 0, 1, 0, 2, 0. 1098 00:55:04,180 --> 00:55:06,860 And then basically saying same thing, [UNINTELLIGIBLE] 1099 00:55:06,860 --> 00:55:10,230 instead of saying 4i plus j, it's just 4j plus i. 1100 00:55:10,230 --> 00:55:12,340 You can do both. 1101 00:55:12,340 --> 00:55:18,460 So this is my nice matrix multiply function, because the 1102 00:55:18,460 --> 00:55:21,240 reason I did it this way, it's just very clear how many axis 1103 00:55:21,240 --> 00:55:23,810 I'm going to use. 1104 00:55:23,810 --> 00:55:28,920 So if you look at C, let's look at only the inner loop. 1105 00:55:28,920 --> 00:55:30,780 Most of the mapping matters in the inner loop. 1106 00:55:30,780 --> 00:55:35,820 I go for k in 0 to size. 1107 00:55:35,820 --> 00:55:40,830 Only first axis on C is going to miss because it's going up 1108 00:55:40,830 --> 00:55:41,060 [UNINTELLIGIBLE] 1109 00:55:41,060 --> 00:55:44,050 same K, da-da-da-da-da, so C is very nice in here. 1110 00:55:44,050 --> 00:55:45,330 Only get one miss in here. 1111 00:55:45,330 --> 00:55:46,273 You have a question? 1112 00:55:46,273 --> 00:55:47,120 OK. 1113 00:55:47,120 --> 00:55:51,680 A has a streaming pattern because every time I update K, 1114 00:55:51,680 --> 00:55:53,360 I go to the next element, next element, next 1115 00:55:53,360 --> 00:55:55,360 element, next element. 1116 00:55:55,360 --> 00:55:57,090 [INAUDIBLE]. 1117 00:55:57,090 --> 00:55:59,828 What type of a pattern do I have in B? 1118 00:55:59,828 --> 00:56:01,510 AUDIENCE: Looks like [INAUDIBLE]. 1119 00:56:01,510 --> 00:56:01,970 PROFESSOR: Yeah, exactly. 1120 00:56:01,970 --> 00:56:03,270 I have jumping on memory. 1121 00:56:03,270 --> 00:56:04,510 I'm jumping one to another. 1122 00:56:04,510 --> 00:56:07,590 So if you look at that, I am going through A like that, and 1123 00:56:07,590 --> 00:56:11,730 B, and going through C element at a time, basically. 1124 00:56:11,730 --> 00:56:15,360 So if you look at doing this in order to calculate one 1125 00:56:15,360 --> 00:56:18,710 element of A, [INAUDIBLE] calculating these, what you 1126 00:56:18,710 --> 00:56:21,850 have to do is I am getting B and C this way. 1127 00:56:21,850 --> 00:56:24,820 I'm going through row here and column, and this row 1128 00:56:24,820 --> 00:56:25,500 [UNINTELLIGIBLE] 1129 00:56:25,500 --> 00:56:27,510 fit into the cache line through that because this is 1130 00:56:27,510 --> 00:56:28,230 the memory. 1131 00:56:28,230 --> 00:56:31,970 But C is just jumping all over the memory, and I have no 1132 00:56:31,970 --> 00:56:34,200 locality in here. 1133 00:56:34,200 --> 00:56:36,490 So what can we do here? 1134 00:56:36,490 --> 00:56:38,290 What's the simplest thing in here? 1135 00:56:38,290 --> 00:56:40,069 How do we get some locality into the C? 1136 00:56:46,166 --> 00:56:48,360 AUDIENCE: [INAUDIBLE] 1137 00:56:48,360 --> 00:56:51,050 PROFESSOR: So yeah, you're coming, you're getting it. 1138 00:56:51,050 --> 00:56:53,170 That's the next thing that's most important. 1139 00:56:53,170 --> 00:56:54,460 That's the more advanced thing you can do. 1140 00:56:54,460 --> 00:56:55,280 We'll get there. 1141 00:56:55,280 --> 00:56:58,298 What was a simple thing you can do for C? 1142 00:56:58,298 --> 00:56:59,130 AUDIENCE: [INAUDIBLE] 1143 00:56:59,130 --> 00:57:00,380 PROFESSOR: [UNINTELLIGIBLE] 1144 00:57:02,540 --> 00:57:03,630 So here's the [UNINTELLIGIBLE] 1145 00:57:03,630 --> 00:57:07,440 execution time, original, and voila, when you transpose 1146 00:57:07,440 --> 00:57:10,830 that, you can actually get a huge benefit in here because, 1147 00:57:10,830 --> 00:57:14,950 now, I get spatial locality in both sides in here. 1148 00:57:14,950 --> 00:57:16,570 And so now when you're transposed, I am going 1149 00:57:16,570 --> 00:57:18,630 through like this. 1150 00:57:18,630 --> 00:57:22,500 So now what I did was I am giving this CPI, and what 1151 00:57:22,500 --> 00:57:24,680 happens is when you transpose, I have a 5x 1152 00:57:24,680 --> 00:57:26,020 improvement in sync CPI. 1153 00:57:26,020 --> 00:57:28,570 That means instructions actually not waiting, not 1154 00:57:28,570 --> 00:57:29,460 doing nothing. 1155 00:57:29,460 --> 00:57:34,900 I have a pretty good reduce of cache misread. 1156 00:57:34,900 --> 00:57:36,340 because, now, I'm actually going 1157 00:57:36,340 --> 00:57:38,640 through cache very nicely. 1158 00:57:38,640 --> 00:57:41,090 So the next thing we can do is, what she just 1159 00:57:41,090 --> 00:57:42,420 pointed out, blocking. 1160 00:57:42,420 --> 00:57:45,560 So instead of matrix multiply, just multiplying this matrix 1161 00:57:45,560 --> 00:57:48,920 at a time, you can multiply submatrices and calculate 1162 00:57:48,920 --> 00:57:50,260 small submatrices. 1163 00:57:50,260 --> 00:57:52,950 And if you do that, the submatrix can nicely fit in 1164 00:57:52,950 --> 00:57:54,440 the memory, so let me show you that. 1165 00:57:54,440 --> 00:57:56,510 This is the blocking code. 1166 00:57:56,510 --> 00:57:58,780 Let me show you what that means. 1167 00:57:58,780 --> 00:58:02,580 So what we are doing is you are multiplying this matrix by 1168 00:58:02,580 --> 00:58:06,030 this matrix in there and getting this matrix. 1169 00:58:06,030 --> 00:58:09,240 So what that means is, normally, what you do is, to 1170 00:58:09,240 --> 00:58:13,490 get one element, you have to multiply a row and a column. 1171 00:58:13,490 --> 00:58:16,200 Normally, we are calculating a row of A at a time. 1172 00:58:16,200 --> 00:58:20,420 In order to get a row of A, I had to get a row of B 1173 00:58:20,420 --> 00:58:20,860 [UNINTELLIGIBLE] 1174 00:58:20,860 --> 00:58:24,520 entire C has to participate into calculate A. So by the 1175 00:58:24,520 --> 00:58:28,440 time I have calculated 1024 elements, I have gone through 1176 00:58:28,440 --> 00:58:32,230 this many axises to new elements basically because I 1177 00:58:32,230 --> 00:58:34,970 am getting B again and again, but I had to go through the 1178 00:58:34,970 --> 00:58:36,310 entire C in here. 1179 00:58:36,310 --> 00:58:40,550 So if this much data is bigger than the cache, I have no 1180 00:58:40,550 --> 00:58:42,830 locality in here. 1181 00:58:42,830 --> 00:58:47,590 But if, instead of calculating a row, if I calculate a 32 by 1182 00:58:47,590 --> 00:58:52,570 32 block, which is again 1024 elements I am calculating. 1183 00:58:52,570 --> 00:58:56,050 If I do that, I need a block here and a block here, and I 1184 00:58:56,050 --> 00:58:57,760 only type this much data. 1185 00:59:01,200 --> 00:59:05,610 OK, I'm getting the same amount of output, but because 1186 00:59:05,610 --> 00:59:11,150 of reuse, I touch a lot less inputting here. 1187 00:59:11,150 --> 00:59:13,490 And just by doing this transformation, I can really 1188 00:59:13,490 --> 00:59:16,540 reduce the amount of data. 1189 00:59:16,540 --> 00:59:19,330 Basically, I can create a smaller working set, go 1190 00:59:19,330 --> 00:59:22,240 through that much more, and reduce the amount of cache 1191 00:59:22,240 --> 00:59:23,150 misses you get. 1192 00:59:23,150 --> 00:59:26,750 So by doing that, I basically go out another 3x improvement 1193 00:59:26,750 --> 00:59:30,140 in my CPI, really brought down my L1 cache misses, and 1194 00:59:30,140 --> 00:59:34,070 basically eliminated L2 cache misses. 1195 00:59:34,070 --> 00:59:38,280 OK, so I got a really good speed up in here. 1196 00:59:38,280 --> 00:59:41,370 So you can block multiple levels, L1, L2, L3, because 1197 00:59:41,370 --> 00:59:43,750 every time you figure out the workings and size, you block 1198 00:59:43,750 --> 00:59:48,550 so the data in that will be safe in the cache in here. 1199 00:59:48,550 --> 00:59:51,160 And this is kind of nasty to do by hand. 1200 00:59:51,160 --> 00:59:52,410 We'll see. 1201 00:59:55,400 --> 00:59:57,840 So the other interesting thing is call stack locality. 1202 00:59:57,840 --> 01:00:02,800 So if you do something like fib in here, traditional fib, 1203 01:00:02,800 --> 01:00:06,100 if you do fib(4), you calculate a fib(4), fib(3), 1204 01:00:06,100 --> 01:00:08,330 fib(2), you kind of build a stack, go up 1205 01:00:08,330 --> 01:00:10,830 back again in here. 1206 01:00:10,830 --> 01:00:14,020 So if you keep doing that, the nice thing is you're 1207 01:00:14,020 --> 01:00:17,620 going up, but its-- 1208 01:00:17,620 --> 01:00:20,050 Hopefully, one of these days, I am going to hit the wrong 1209 01:00:20,050 --> 01:00:21,300 button, and-- 1210 01:00:28,520 --> 01:00:31,110 What kind of locality do we have if we are accessing data 1211 01:00:31,110 --> 01:00:32,930 like this in my call stack? 1212 01:00:38,090 --> 01:00:39,650 What kind of locality do we have? 1213 01:00:43,650 --> 01:00:47,170 I mean, I access here, fib(4), and I get back to fib(4) here. 1214 01:00:51,320 --> 01:00:54,890 So within each frame, I am accessing data next to each 1215 01:00:54,890 --> 01:01:01,810 other, so I can have a little bit of a spatial locality 1216 01:01:01,810 --> 01:01:04,640 because, each frame, all the data is next to each other. 1217 01:01:04,640 --> 01:01:07,780 But across frames, normally, I will have some temporal 1218 01:01:07,780 --> 01:01:11,110 because I will get back to that data some time. 1219 01:01:11,110 --> 01:01:12,360 OK? 1220 01:01:18,790 --> 01:01:21,460 OK, before I summarize, this one. 1221 01:01:21,460 --> 01:01:26,240 If you are reading a lot of data, processing through stage 1222 01:01:26,240 --> 01:01:27,590 one, [UNINTELLIGIBLE] 1223 01:01:27,590 --> 01:01:29,680 and producing [UNINTELLIGIBLE], that means 1224 01:01:29,680 --> 01:01:32,000 you're reading a huge amount of data through stage one, 1225 01:01:32,000 --> 01:01:37,600 save, save data again, do stage two, you have really bad 1226 01:01:37,600 --> 01:01:38,780 cache locality. 1227 01:01:38,780 --> 01:01:41,950 One thing you can do is read a small chunk, do all the stages 1228 01:01:41,950 --> 01:01:44,970 on that small chunk, save it, and go to another chunk. 1229 01:01:44,970 --> 01:01:46,890 So instead of doing the entire data, doing something 1230 01:01:46,890 --> 01:01:47,370 [UNINTELLIGIBLE] 1231 01:01:47,370 --> 01:01:50,550 producing, you can do this chunk-ifying, and by doing 1232 01:01:50,550 --> 01:01:52,220 that, you can get a lot of locality. 1233 01:01:52,220 --> 01:01:56,100 So once you bring in data, you like to do as many things as 1234 01:01:56,100 --> 01:01:59,620 possible to that data before it gets out of the cache. 1235 01:01:59,620 --> 01:02:01,260 That's the entire thing about locality, and you can 1236 01:02:01,260 --> 01:02:02,810 restructure code through that. 1237 01:02:02,810 --> 01:02:07,680 So if you look at that, what we have done today, we have 1238 01:02:07,680 --> 01:02:10,510 looked at things like associativity, cache lines, 1239 01:02:10,510 --> 01:02:16,035 cache sets in here, and how to address and map to a cache, we 1240 01:02:16,035 --> 01:02:18,780 kind of did this thing, and we look at [UNINTELLIGIBLE] 1241 01:02:18,780 --> 01:02:21,550 streamings versus InCache versus no 1242 01:02:21,550 --> 01:02:24,250 locality type patterns. 1243 01:02:24,250 --> 01:02:27,540 And then when you look at data structure transformation, we 1244 01:02:27,540 --> 01:02:31,390 look at how to basically replace structured arrays and 1245 01:02:31,390 --> 01:02:35,130 split data structures to make sure that you get a lot of 1246 01:02:35,130 --> 01:02:40,230 nice locality if you access a certain amount of data, and 1247 01:02:40,230 --> 01:02:43,200 pack data to get smaller cache footprints kind of things. 1248 01:02:43,200 --> 01:02:45,760 And we also look at computer transformations, things like 1249 01:02:45,760 --> 01:02:48,840 transpose copying, compute copy, outtype things you can 1250 01:02:48,840 --> 01:02:53,040 do, so you can actually get nice locality, so you can do 1251 01:02:53,040 --> 01:02:55,510 both data and compute transformations. 1252 01:02:55,510 --> 01:02:57,300 And [UNINTELLIGIBLE] 1253 01:02:57,300 --> 01:03:01,360 axises and [? re-cast ?] 1254 01:03:01,360 --> 01:03:04,570 calculation stages, so this is kind of summary. 1255 01:03:04,570 --> 01:03:07,560 Next thing I want go is look at what machines 1256 01:03:07,560 --> 01:03:08,790 you guys are using. 1257 01:03:08,790 --> 01:03:11,810 So this is the machines we used last year. 1258 01:03:11,810 --> 01:03:15,390 These are quad cores, and it had L1 cache, so these 1259 01:03:15,390 --> 01:03:16,130 [UNINTELLIGIBLE]. 1260 01:03:16,130 --> 01:03:20,800 So this has 32 kilobyte L1 cache, data cache 1261 01:03:20,800 --> 01:03:21,520 [UNINTELLIGIBLE] 1262 01:03:21,520 --> 01:03:24,740 two caches in here for data and instructions, each core, 1263 01:03:24,740 --> 01:03:27,890 and then the cache line size is normally 64 bytes 1264 01:03:27,890 --> 01:03:28,930 [UNINTELLIGIBLE]. 1265 01:03:28,930 --> 01:03:31,690 And here, latency [UNINTELLIGIBLE], you get 1266 01:03:31,690 --> 01:03:33,780 three cycles. 1267 01:03:33,780 --> 01:03:41,420 If not, you get 14 cycles to go to to L2 in here if data 1268 01:03:41,420 --> 01:03:42,360 doesn't meet there. 1269 01:03:42,360 --> 01:03:43,700 And associated with this [UNINTELLIGIBLE PHRASE]. 1270 01:03:46,330 --> 01:03:48,690 So that was what we did last year. 1271 01:03:48,690 --> 01:03:52,390 This year, we have a bigger machine that has multiple 1272 01:03:52,390 --> 01:03:53,140 levels of caches. 1273 01:03:53,140 --> 01:03:57,490 So each core has L1 instruction and data cache. 1274 01:03:57,490 --> 01:04:00,160 It's kind of the same, 32 kilobyte cache, four 1275 01:04:00,160 --> 01:04:02,090 nanosecond latency in here. 1276 01:04:02,090 --> 01:04:03,930 Eight, three and four [UNINTELLIGIBLE], they kind of 1277 01:04:03,930 --> 01:04:10,370 reduced that, but they added a 256-kilobyte L2 cache in here, 1278 01:04:10,370 --> 01:04:13,120 and then a pretty large humongo 1279 01:04:13,120 --> 01:04:16,280 12-megabyte L3 cache in here. 1280 01:04:16,280 --> 01:04:19,270 The interesting thing is here, this pretty fast access to 1281 01:04:19,270 --> 01:04:19,910 these ones. 1282 01:04:19,910 --> 01:04:27,410 If you miss in here, this is about 2 and 1/2 times cost 1283 01:04:27,410 --> 01:04:29,190 from going from here to here. 1284 01:04:29,190 --> 01:04:33,780 From here to [UNINTELLIGIBLE], about a five times cost, so 1285 01:04:33,780 --> 01:04:35,710 here to here is a five times cost. 1286 01:04:35,710 --> 01:04:37,835 The interesting thing that actually surprises a lot us 1287 01:04:37,835 --> 01:04:40,410 was here to here, it's cost is not that different. 1288 01:04:40,410 --> 01:04:42,420 [UNINTELLIGIBLE PHRASE] 1289 01:04:42,420 --> 01:04:44,435 So actually memory, even though it's sitting out in 1290 01:04:44,435 --> 01:04:48,520 there, they have managed to get it pretty cheap access. 1291 01:04:48,520 --> 01:04:50,970 Normally, my intuition say this should be much higher 1292 01:04:50,970 --> 01:04:54,200 because you're going off-chip, so this is the structure here. 1293 01:04:54,200 --> 01:04:56,470 It's a lot more complicated. 1294 01:04:56,470 --> 01:04:58,930 So here's something I produced last year. 1295 01:04:58,930 --> 01:05:02,830 So what I did was I ran this program. 1296 01:05:02,830 --> 01:05:06,210 And in these machines, there's a way to turn off the prefetch 1297 01:05:06,210 --> 01:05:08,060 of the previous [UNINTELLIGIBLE]. 1298 01:05:08,060 --> 01:05:09,940 So when you've turned on the prefetch and then you run, 1299 01:05:09,940 --> 01:05:13,230 what you are doing is I am accessing data with a working 1300 01:05:13,230 --> 01:05:15,520 set, small working set, little bit bigger working set. 1301 01:05:15,520 --> 01:05:19,360 So this is basically when you consider working set size. 1302 01:05:19,360 --> 01:05:21,890 So you get a graph like this. 1303 01:05:21,890 --> 01:05:28,120 I am hitting my L1 cache, and if it is more than that, I 1304 01:05:28,120 --> 01:05:29,170 [UNINTELLIGIBLE] 1305 01:05:29,170 --> 01:05:30,520 L1 cache, I go to my L2 cache. 1306 01:05:34,440 --> 01:05:37,900 And if L2 misses, I have to go to main memory. 1307 01:05:37,900 --> 01:05:40,320 Very traditional kind of a graph in here. 1308 01:05:43,450 --> 01:05:46,160 Another a little bit of an interesting thing, which is 1309 01:05:46,160 --> 01:05:47,850 instead of accessing next item, I kind 1310 01:05:47,850 --> 01:05:48,730 of skipped in here. 1311 01:05:48,730 --> 01:05:57,590 So this trying to force how to get conflict misses. 1312 01:05:57,590 --> 01:06:02,110 And so if you line up things like this exactly, what 1313 01:06:02,110 --> 01:06:04,470 happens is you get these conflict misses, and you get 1314 01:06:04,470 --> 01:06:06,400 this number. 1315 01:06:06,400 --> 01:06:07,760 So I have interesting story to tell. 1316 01:06:07,760 --> 01:06:09,860 When I was a graduate student, we were running this bunch of 1317 01:06:09,860 --> 01:06:14,000 benchmarks, and normally when you run a benchmark, you try 1318 01:06:14,000 --> 01:06:15,240 it for an equal 1319 01:06:15,240 --> 01:06:18,820 [UNINTELLIGIBLE], stuff like that. 1320 01:06:18,820 --> 01:06:21,310 And we got this graph like this. 1321 01:06:21,310 --> 01:06:26,410 I mean, we expect a graph going like this. 1322 01:06:26,410 --> 01:06:28,450 OK, that how a benchmark [UNINTELLIGIBLE]. 1323 01:06:28,450 --> 01:06:32,080 Then at one point, by accident, instead of typing an 1324 01:06:32,080 --> 01:06:37,120 equal 64-something, I typed an equal 63, and suddenly, I got 1325 01:06:37,120 --> 01:06:39,130 a performance like this. 1326 01:06:39,130 --> 01:06:41,250 Wait a minute, what's going on? 1327 01:06:41,250 --> 01:06:43,420 So in here, what I got was not a line. 1328 01:06:43,420 --> 01:06:49,020 I got 16, 32, 64, 128, nice graph like that. 1329 01:06:49,020 --> 01:06:51,540 And then I just did 63, like down to here. 1330 01:06:51,540 --> 01:06:53,390 I'm like, geeze, that can't be right. 1331 01:06:53,390 --> 01:06:55,500 And then I start looking at 62, which is here. 1332 01:06:55,500 --> 01:06:56,770 61 is here. 1333 01:06:56,770 --> 01:07:00,720 So I realize if I enumerate all those things, I got a 1334 01:07:00,720 --> 01:07:01,970 graph like this, basically. 1335 01:07:05,380 --> 01:07:07,600 Kind of something like this. 1336 01:07:07,600 --> 01:07:11,820 And the thing is, if I had never tried the 63, I would 1337 01:07:11,820 --> 01:07:13,780 have basically thought my performance was like 1338 01:07:13,780 --> 01:07:16,310 this, so far worse. 1339 01:07:16,310 --> 01:07:20,210 Never realized because it was all having conflicts in memory 1340 01:07:20,210 --> 01:07:21,220 because I'm [UNINTELLIGIBLE] 1341 01:07:21,220 --> 01:07:23,120 2 to the power numbers. 1342 01:07:23,120 --> 01:07:25,300 So this is what happens if you just look at 2 to the power 1343 01:07:25,300 --> 01:07:27,280 numbers, you'll think, yeah, this is the performance you're 1344 01:07:27,280 --> 01:07:29,160 getting, without realizing that, in fact, you 1345 01:07:29,160 --> 01:07:31,290 can get this far. 1346 01:07:31,290 --> 01:07:32,140 So this is good. 1347 01:07:32,140 --> 01:07:34,200 So I ran it on the new machine. 1348 01:07:34,200 --> 01:07:36,650 The problem with our new machine is there's no way to 1349 01:07:36,650 --> 01:07:39,510 turn off prefetching. 1350 01:07:39,510 --> 01:07:41,520 OK, so this is what you're getting. 1351 01:07:41,520 --> 01:07:44,320 Everything, because I am accessing this nice linear, 1352 01:07:44,320 --> 01:07:45,640 looks same. 1353 01:07:45,640 --> 01:07:47,550 This is exactly what you want to happen. 1354 01:07:47,550 --> 01:07:50,760 What it's doing is, minute you go to access, it realizes you 1355 01:07:50,760 --> 01:07:51,800 accessing linearly. 1356 01:07:51,800 --> 01:07:54,190 It start fetching, fetching, fetching data in front you, 1357 01:07:54,190 --> 01:07:55,310 it's going to catch up. 1358 01:07:55,310 --> 01:07:57,840 It's not feeding you data, so you're not going to have any 1359 01:07:57,840 --> 01:08:00,760 cache missed because you're going through this nice, 1360 01:08:00,760 --> 01:08:04,060 linear pattern, and you just basic end up with this 1361 01:08:04,060 --> 01:08:08,250 beautiful graph, which gives the feeling your working set 1362 01:08:08,250 --> 01:08:09,870 could be, now, gigabytes long. 1363 01:08:14,130 --> 01:08:19,620 At this point, I am fitting into the entire L3 cache, and 1364 01:08:19,620 --> 01:08:22,620 still look like this works beautiful because this is what 1365 01:08:22,620 --> 01:08:24,210 you want for it to figure out. 1366 01:08:24,210 --> 01:08:25,740 So now I'm like, geeze, this is great. 1367 01:08:25,740 --> 01:08:28,569 This means my cache works, it's very nice, but I want to 1368 01:08:28,569 --> 01:08:30,850 show you guys what's going on, so that doesn't help. 1369 01:08:30,850 --> 01:08:33,479 So I came up with this program. 1370 01:08:33,479 --> 01:08:37,240 So this program, what I did was I created a working set, 1371 01:08:37,240 --> 01:08:39,689 but I'm accessing randomly within that working set. 1372 01:08:39,689 --> 01:08:40,880 I calculated this address. 1373 01:08:40,880 --> 01:08:44,700 This is my poor man's random number generator, so basically 1374 01:08:44,700 --> 01:08:55,630 I'm adding number in here and multiplying this, and then the 1375 01:08:55,630 --> 01:08:59,290 masking these to get the right [UNINTELLIGIBLE] 1376 01:08:59,290 --> 01:09:01,850 so I get within that set in here. 1377 01:09:01,850 --> 01:09:04,040 And then I just access data, so I'm getting 1378 01:09:04,040 --> 01:09:04,490 [UNINTELLIGIBLE]. 1379 01:09:04,490 --> 01:09:06,770 So I get a number like this, and this jump up here, and 1380 01:09:06,770 --> 01:09:09,620 then I had this kind of behavior. 1381 01:09:09,620 --> 01:09:10,750 This is a lot more complicated. 1382 01:09:10,750 --> 01:09:12,779 I need to figure out what the hell is going on in here, so I 1383 01:09:12,779 --> 01:09:15,729 said, OK, let's now look at performance counters. 1384 01:09:15,729 --> 01:09:17,340 I look at L1 cache miss. 1385 01:09:17,340 --> 01:09:20,620 OK, up to 32K, I have no cache misses. 1386 01:09:20,620 --> 01:09:22,694 Boom, cache misses jump up. 1387 01:09:22,694 --> 01:09:23,189 Perfect. 1388 01:09:23,189 --> 01:09:26,380 And I have this little bit of a kink in here, this beautiful 1389 01:09:26,380 --> 01:09:30,670 describes because I have 32K L1 cache, and until then, I 1390 01:09:30,670 --> 01:09:32,990 have no things, and then when the cache start missing, I'm 1391 01:09:32,990 --> 01:09:34,149 jumping up. 1392 01:09:34,149 --> 01:09:34,750 Perfect. 1393 01:09:34,750 --> 01:09:36,550 I got that description here. 1394 01:09:36,550 --> 01:09:38,779 Then I look at my L2 cache misses. 1395 01:09:38,779 --> 01:09:41,915 This is, I think, the L2 cache in this machine might have a 1396 01:09:41,915 --> 01:09:43,340 little bit more complex [UNINTELLIGIBLE] 1397 01:09:43,340 --> 01:09:45,630 telling us because this is not going like 1398 01:09:45,630 --> 01:09:47,450 here had jumping up. 1399 01:09:47,450 --> 01:09:49,760 That's a 256K [UNINTELLIGIBLE] 1400 01:09:49,760 --> 01:09:50,240 get. 1401 01:09:50,240 --> 01:09:50,880 Is this 256? 1402 01:09:50,880 --> 01:09:51,720 Yeah. 1403 01:09:51,720 --> 01:09:52,580 [INAUDIBLE] 1404 01:09:52,580 --> 01:09:53,760 250 is [UNINTELLIGIBLE] 1405 01:09:53,760 --> 01:09:54,130 these things. 1406 01:09:54,130 --> 01:09:58,920 OK, so we have 32 256 2L megabytes in here. 1407 01:09:58,920 --> 01:10:03,034 Even if you have 256 in here-- 1408 01:10:03,034 --> 01:10:04,284 AUDIENCE: [INAUDIBLE] 1409 01:10:06,506 --> 01:10:07,002 PROFESSOR: Hm? 1410 01:10:07,002 --> 01:10:09,386 AUDIENCE: Should it be 256 or 32-- 1411 01:10:09,386 --> 01:10:10,670 PROFESSOR: [UNINTELLIGIBLE] 1412 01:10:10,670 --> 01:10:12,380 because everything's backed up, it's hierarchy 1413 01:10:12,380 --> 01:10:13,940 [UNINTELLIGIBLE], so it's not added to, it's not next to 1414 01:10:13,940 --> 01:10:16,820 each other, because at this point, L1 is now useless. 1415 01:10:16,820 --> 01:10:19,780 It'll run everything is just going to L2, so L2 is the one 1416 01:10:19,780 --> 01:10:21,205 who was serving it. 1417 01:10:24,800 --> 01:10:26,050 Come on. 1418 01:10:29,920 --> 01:10:32,640 So what happens in here, this normally should start jumping 1419 01:10:32,640 --> 01:10:36,390 here, 256, but it start jumping earlier, so there 1420 01:10:36,390 --> 01:10:37,570 might be some other interesting 1421 01:10:37,570 --> 01:10:40,090 thing going on here. 1422 01:10:40,090 --> 01:10:42,130 It might be interesting to investigate. 1423 01:10:42,130 --> 01:10:44,290 So this is OK, so still I am pretty close. 1424 01:10:44,290 --> 01:10:48,740 I realized this jump is basically because of L2 1425 01:10:48,740 --> 01:10:49,860 happen, this jump in here. 1426 01:10:49,860 --> 01:10:51,174 L2 start missing. 1427 01:10:51,174 --> 01:10:53,890 Then I look at L3. 1428 01:10:53,890 --> 01:10:56,540 It's here. 1429 01:10:56,540 --> 01:10:58,940 What's happening here? 1430 01:10:58,940 --> 01:11:03,520 So this was a big mystery the two of us were trying to debug 1431 01:11:03,520 --> 01:11:06,130 what one hour ago. 1432 01:11:06,130 --> 01:11:07,440 Not one hour. 1433 01:11:07,440 --> 01:11:08,030 Class is going. 1434 01:11:08,030 --> 01:11:08,710 Like two hours ago. 1435 01:11:08,710 --> 01:11:10,140 We was trying to figure out, OK, wait a minute, this 1436 01:11:10,140 --> 01:11:10,750 doesn't make sense. 1437 01:11:10,750 --> 01:11:12,000 My performance is here. 1438 01:11:12,000 --> 01:11:14,950 L3 should be just basically a nice flattening out here, and 1439 01:11:14,950 --> 01:11:16,020 just jumping. 1440 01:11:16,020 --> 01:11:18,480 It's not doing that, and we are like what's going on here. 1441 01:11:18,480 --> 01:11:21,940 So then we were scratching our heads, and we start looking. 1442 01:11:21,940 --> 01:11:23,630 I say, ha, there were other things going on. 1443 01:11:23,630 --> 01:11:26,670 We looked at this thing called TLB. 1444 01:11:26,670 --> 01:11:28,110 Let me tell you what TLB is. 1445 01:11:28,110 --> 01:11:29,990 What we realize is at this point, we are [? seeing ?] 1446 01:11:29,990 --> 01:11:32,200 TLB misses. 1447 01:11:32,200 --> 01:11:35,900 How many people know what a TLB is? 1448 01:11:35,900 --> 01:11:39,280 OK, good, so let me spend the last couple of minutes 1449 01:11:39,280 --> 01:11:41,360 explaining what a TLB this, and that kind of explain 1450 01:11:41,360 --> 01:11:43,390 what's going on in this graph. 1451 01:11:43,390 --> 01:11:48,080 So TLB means-- 1452 01:11:48,080 --> 01:11:54,910 Normally, data is stored in memory using real addresses. 1453 01:11:54,910 --> 01:11:56,970 That means there's a memory size in here, there's an 1454 01:11:56,970 --> 01:11:59,140 address format, there's certain [UNINTELLIGIBLE]. 1455 01:11:59,140 --> 01:12:02,400 And then the program for it use virtual memory. 1456 01:12:02,400 --> 01:12:05,210 That means programs want to feel like they have a lot more 1457 01:12:05,210 --> 01:12:07,540 memory than actual, physical memory available. 1458 01:12:07,540 --> 01:12:10,330 So what happens is you have virtual memory address, but 1459 01:12:10,330 --> 01:12:13,260 somebody has to map virtual memory to physical memory. 1460 01:12:13,260 --> 01:12:16,260 So the way it work is there's a thing called translation 1461 01:12:16,260 --> 01:12:20,690 lookaside buffer, TLB, that says if you give this virtual 1462 01:12:20,690 --> 01:12:22,600 address, here's the right physical address. 1463 01:12:22,600 --> 01:12:25,590 So of course you can't do it for each address by address. 1464 01:12:25,590 --> 01:12:30,880 What you have is you have 4K pages, so the address space is 1465 01:12:30,880 --> 01:12:32,625 broken down into 4K pages. 1466 01:12:35,410 --> 01:12:42,140 Each page will have an entry in this TLB, saying if you get 1467 01:12:42,140 --> 01:12:46,270 this virtual address, here's the right physical address. 1468 01:12:46,270 --> 01:12:48,530 So normally what happens is when the operating system they 1469 01:12:48,530 --> 01:12:53,030 created this large table, for every place in memory, you use 1470 01:12:53,030 --> 01:12:56,040 this table, and it loads into memory. 1471 01:12:56,040 --> 01:12:59,220 Then the hardware will fetch and cache some of those 1472 01:12:59,220 --> 01:13:03,820 entries in this TLB cache in there. 1473 01:13:03,820 --> 01:13:09,680 So normally what happens is in here, it caches 52L entries. 1474 01:13:09,680 --> 01:13:12,620 It can keep 52L entries to 52L pages, 1475 01:13:12,620 --> 01:13:14,560 saying here's the mapping. 1476 01:13:14,560 --> 01:13:20,240 The problem with 52L pages is in a 4K byte page size, 52L 1477 01:13:20,240 --> 01:13:24,730 pages can only hold two megabytes of data. 1478 01:13:24,730 --> 01:13:27,640 If you're accessing more than two megabytes of data, you 1479 01:13:27,640 --> 01:13:30,100 have to go more than 52L pages, and at that point, that 1480 01:13:30,100 --> 01:13:34,010 mapping is not in the TLB, and that means that you have to go 1481 01:13:34,010 --> 01:13:35,640 do a mapping translation. 1482 01:13:35,640 --> 01:13:38,760 The problem for this is these 4K pages is a 1483 01:13:38,760 --> 01:13:42,440 very ancient artifact. 1484 01:13:42,440 --> 01:13:45,480 There's the new machines, you can I ask for what we call 1485 01:13:45,480 --> 01:13:48,195 large-pages or super, but you can ask for 2-megabyte or 1486 01:13:48,195 --> 01:13:49,620 4-megabyte pages. 1487 01:13:49,620 --> 01:13:53,220 If you ask for super-pages, then of course, 52L entry's 1488 01:13:53,220 --> 01:13:54,710 good enough. 1489 01:13:54,710 --> 01:13:58,550 It can hold enough things not to have TLB miss, at least 1490 01:13:58,550 --> 01:13:59,480 while you're in the cache. 1491 01:13:59,480 --> 01:14:02,520 But since Linux, without doing anything, only asks for 4K 1492 01:14:02,520 --> 01:14:06,230 pages, before you run out of your L3 cache, you run out of 1493 01:14:06,230 --> 01:14:08,140 the TLB table. 1494 01:14:08,140 --> 01:14:11,650 And it has to start fetching and [UNINTELLIGIBLE] 1495 01:14:11,650 --> 01:14:14,330 data table, that means you had to actually fetch the table 1496 01:14:14,330 --> 01:14:16,410 entries, and so you have to do more fetchings. 1497 01:14:16,410 --> 01:14:18,880 So that is why this is even running higher, because it's 1498 01:14:18,880 --> 01:14:22,370 trying to fetch this TLB entry and stuff like that. 1499 01:14:22,370 --> 01:14:22,830 Question? 1500 01:14:22,830 --> 01:14:25,508 AUDIENCE: Is there a way to then actually fiddle with the 1501 01:14:25,508 --> 01:14:29,170 TLB so that you have to fit your pages rather than-- 1502 01:14:29,170 --> 01:14:32,450 PROFESSOR: So in allocation, you can ask for super-pages or 1503 01:14:32,450 --> 01:14:34,400 large-pages. 1504 01:14:34,400 --> 01:14:38,520 But in Linux, normally just ask for 4K pages, so that the 1505 01:14:38,520 --> 01:14:40,390 operating systems haven't kind of caught up 1506 01:14:40,390 --> 01:14:42,050 to the current hardware. 1507 01:14:42,050 --> 01:14:43,560 And this is was built [UNINTELLIGIBLE] 1508 01:14:43,560 --> 01:14:45,860 doesn't make sense TLB to run out before 1509 01:14:45,860 --> 01:14:47,110 you run out the cache. 1510 01:14:47,110 --> 01:14:49,590 I mean, that's kind of mismatch in here. 1511 01:14:49,590 --> 01:14:52,480 But it's built assuming you'll get large-pages, but if you 1512 01:14:52,480 --> 01:14:54,780 don't, you're going to run out of the TLB in there. 1513 01:14:54,780 --> 01:14:58,440 So this is why sometimes performance engineering is 1514 01:14:58,440 --> 01:15:01,470 interesting, exciting, and frustrating, because you 1515 01:15:01,470 --> 01:15:03,210 assume you know where the hell this is going. 1516 01:15:03,210 --> 01:15:05,460 You assume this is going to go here and that, and suddenly, 1517 01:15:05,460 --> 01:15:08,240 this miss shows up, and so we have to scratch our heads. 1518 01:15:08,240 --> 01:15:11,820 And luckily, we found what's going on, but we could have 1519 01:15:11,820 --> 01:15:14,560 spent hours or days trying to figure out what's going on 1520 01:15:14,560 --> 01:15:18,480 here because this is one [UNINTELLIGIBLE] 1521 01:15:18,480 --> 01:15:19,900 say OK, let's try TLB, and it worked. 1522 01:15:19,900 --> 01:15:23,870 That was really good, but I have, many cases, spent days 1523 01:15:23,870 --> 01:15:27,450 trying to figure out why something is going on because 1524 01:15:27,450 --> 01:15:29,630 performance is kind of a [UNINTELLIGIBLE] 1525 01:15:29,630 --> 01:15:30,210 of everything. 1526 01:15:30,210 --> 01:15:32,300 There's no nice abstractions and stuff like that. 1527 01:15:32,300 --> 01:15:35,590 If you think abstraction wise, somewhere outside the 1528 01:15:35,590 --> 01:15:37,580 abstraction, that's going to kill you. 1529 01:15:37,580 --> 01:15:39,750 So that is why it can be interesting, because you have 1530 01:15:39,750 --> 01:15:41,870 to know everything. 1531 01:15:41,870 --> 01:15:45,030 And the fun thing is, once you understand performance, you 1532 01:15:45,030 --> 01:15:47,730 basically understand end-to-end, soup to nuts, 1533 01:15:47,730 --> 01:15:49,160 what's going on. 1534 01:15:49,160 --> 01:15:52,520 OK, with that encouraging thought, let's see how you 1535 01:15:52,520 --> 01:15:55,335 guys do in the next part of project two.