1 00:00:00,120 --> 00:00:02,500 The following content is provided under a Creative 2 00:00:02,500 --> 00:00:03,910 Commons license. 3 00:00:03,910 --> 00:00:06,950 Your support will help MIT OpenCourseWare continue to 4 00:00:06,950 --> 00:00:10,600 offer high quality educational resources for free. 5 00:00:10,600 --> 00:00:13,500 To make a donation or view additional materials from 6 00:00:13,500 --> 00:00:17,780 hundreds of MIT courses, visit MIT OpenCourseWare at 7 00:00:17,780 --> 00:00:19,030 ocw.mit.edu. 8 00:00:29,180 --> 00:00:31,450 PROFESSOR: As we come close to testing, we 9 00:00:31,450 --> 00:00:32,530 have shrinkage here. 10 00:00:32,530 --> 00:00:34,130 People probably left home. 11 00:00:34,130 --> 00:00:37,700 Hopefully, everybody who left home finished their report. 12 00:00:37,700 --> 00:00:40,280 So you guys have all looked into how to 13 00:00:40,280 --> 00:00:41,400 do the final project. 14 00:00:41,400 --> 00:00:43,900 And have all the ideas how to go and optimize. 15 00:00:47,110 --> 00:00:50,235 How many people have downloaded, compiled, and ran, 16 00:00:50,235 --> 00:00:51,490 and you know what's going on? 17 00:00:51,490 --> 00:00:52,010 OK. 18 00:00:52,010 --> 00:00:52,450 Good. 19 00:00:52,450 --> 00:00:52,660 Good. 20 00:00:52,660 --> 00:00:53,910 Good. 21 00:00:55,930 --> 00:00:57,180 Exactly. 22 00:01:00,710 --> 00:01:01,950 It's happening right now. 23 00:01:01,950 --> 00:01:02,800 OK. 24 00:01:02,800 --> 00:01:03,810 Good. 25 00:01:03,810 --> 00:01:08,161 So I will repeat this what I said last time in here. 26 00:01:08,161 --> 00:01:10,790 We're going to have a design review with your masters. 27 00:01:10,790 --> 00:01:14,840 So just look for us to send you the information. 28 00:01:14,840 --> 00:01:16,670 That means when you come back from 29 00:01:16,670 --> 00:01:18,520 Thanksgiving, schedule it early. 30 00:01:18,520 --> 00:01:23,790 So they can help if you have any changes in design process. 31 00:01:23,790 --> 00:01:27,820 And then we have a competition on December 9 in class here, 32 00:01:27,820 --> 00:01:28,840 trying to figure out who has the 33 00:01:28,840 --> 00:01:33,650 fastest ray tracer created. 34 00:01:33,650 --> 00:01:37,660 And in fact, this year there is Akamai prize for the 35 00:01:37,660 --> 00:01:41,400 winning team, including they have a kind of celebration and 36 00:01:41,400 --> 00:01:42,790 demonstration in their headquarters. 37 00:01:42,790 --> 00:01:46,590 You get to go get a tour with their knock and stuff like. 38 00:01:46,590 --> 00:01:52,440 Plus, every winning member is going to get a iPod Nano. 39 00:01:52,440 --> 00:01:55,440 So there's a lot more motivation now to get the 40 00:01:55,440 --> 00:02:00,280 fastest running ray tracer OK. 41 00:02:00,280 --> 00:02:04,770 So with that, let's switch gears a little bit. 42 00:02:04,770 --> 00:02:10,039 So today, I'm going to talk about distributed systems. 43 00:02:10,039 --> 00:02:13,580 Until now what we looked at was, OK, given a box how to 44 00:02:13,580 --> 00:02:18,555 get something running as fast as possible inside that box. 45 00:02:18,555 --> 00:02:23,170 And today we're going to look at going outside the box. 46 00:02:23,170 --> 00:02:27,750 Basically, we want to scale up to clusters of machines. 47 00:02:27,750 --> 00:02:30,900 That means the room can have 10, 15 machines. 48 00:02:30,900 --> 00:02:35,300 In fact, for your class, you guys are using-- 49 00:02:35,300 --> 00:02:37,440 how many machines do we have? 50 00:02:37,440 --> 00:02:38,170 16 machines. 51 00:02:38,170 --> 00:02:39,510 So you are doing independently. 52 00:02:39,510 --> 00:02:42,280 But you can use as one gigantic machine if you can 53 00:02:42,280 --> 00:02:43,280 and run something. 54 00:02:43,280 --> 00:02:44,540 And data center scale. 55 00:02:44,540 --> 00:02:47,370 This is kind of people like Google, and Amazon, has these 56 00:02:47,370 --> 00:02:48,250 kinds of things. 57 00:02:48,250 --> 00:02:51,100 And finally, Planet Scale. 58 00:02:51,100 --> 00:02:53,790 If you want to run something even bigger, larger. 59 00:02:53,790 --> 00:02:55,420 What you have to deal with, and what kind of issues you 60 00:02:55,420 --> 00:02:55,860 have to deal with. 61 00:02:55,860 --> 00:02:57,600 It's time to reboot my machine. 62 00:02:57,600 --> 00:03:01,890 And I have to be pressing this button probably four or five 63 00:03:01,890 --> 00:03:05,190 times during the day. 64 00:03:05,190 --> 00:03:07,560 So Cluster Scale. 65 00:03:07,560 --> 00:03:11,340 So you want to run a program on multiple machines. 66 00:03:11,340 --> 00:03:12,910 And OK, Let me put it there. 67 00:03:12,910 --> 00:03:14,780 Why the heck do you want to do that? 68 00:03:14,780 --> 00:03:16,930 What's the advantages of-- 69 00:03:16,930 --> 00:03:20,720 instead of running on one nice machine, running on a cluster 70 00:03:20,720 --> 00:03:21,180 of machines? 71 00:03:21,180 --> 00:03:22,430 What do you get? 72 00:03:27,982 --> 00:03:28,914 AUDIENCE: It's cheaper. 73 00:03:28,914 --> 00:03:29,850 PROFESSOR: It's cheaper. 74 00:03:29,850 --> 00:03:30,530 That's a good one. 75 00:03:30,530 --> 00:03:33,910 It's cheaper to get a bunch of small machines than to buy a 76 00:03:33,910 --> 00:03:36,290 humongo mainframe type machine. 77 00:03:36,290 --> 00:03:37,900 Yes, that's a very good answer. 78 00:03:37,900 --> 00:03:39,150 What else? 79 00:03:41,798 --> 00:03:43,734 AUDIENCE: It's very slow. 80 00:03:43,734 --> 00:03:44,870 It's slower. 81 00:03:44,870 --> 00:03:48,290 PROFESSOR: So you would run something because it's slower? 82 00:03:48,290 --> 00:03:50,170 AUDIENCE: But it is a trade-off. 83 00:03:50,170 --> 00:03:54,635 PROFESSOR: Yes, so there's some trade off between speed. 84 00:03:54,635 --> 00:03:57,570 But it might not be that much. 85 00:03:57,570 --> 00:03:59,820 Even when you get a gigantic machine, there are 86 00:03:59,820 --> 00:04:00,630 bottlenecks in it. 87 00:04:00,630 --> 00:04:03,270 In a cluster kind of thing, you can avoid the bottlenecks. 88 00:04:03,270 --> 00:04:05,080 But hopefully, you're trying to do it to get some 89 00:04:05,080 --> 00:04:09,540 performance in scaling to large number of 90 00:04:09,540 --> 00:04:11,680 users and what not. 91 00:04:11,680 --> 00:04:16,470 So basically, what you want to get is-- 92 00:04:16,470 --> 00:04:19,600 so get more parallelism. 93 00:04:19,600 --> 00:04:21,529 Because now we have more machines, more calls. 94 00:04:21,529 --> 00:04:23,120 And hopefully get higher throughput. 95 00:04:23,120 --> 00:04:24,810 Definitely, because you are doing it. 96 00:04:24,810 --> 00:04:27,590 Hopefully, it's a little bit of lower latency too, because 97 00:04:27,590 --> 00:04:32,580 if you have one gigantic system, if everything has to 98 00:04:32,580 --> 00:04:37,150 go through bottlenecks, it might be slower than basically 99 00:04:37,150 --> 00:04:38,390 having different system. 100 00:04:38,390 --> 00:04:46,510 So assume, just an example, if you are something like Verizon 101 00:04:46,510 --> 00:04:49,480 or Netflix trying to serve your videos. 102 00:04:49,480 --> 00:04:52,230 It makes much more sense to have a bunch of clusters of 103 00:04:52,230 --> 00:04:55,360 machines each doing a lot of independent work than trying 104 00:04:55,360 --> 00:04:58,360 to send all the videos to one machine. 105 00:04:58,360 --> 00:05:01,930 Another interesting fact is robustness. 106 00:05:01,930 --> 00:05:04,720 So until now you guys didn't care about robustness, because 107 00:05:04,720 --> 00:05:07,280 something went wrong, the entire thing collapsed. 108 00:05:07,280 --> 00:05:09,420 There's no half baked machine. 109 00:05:09,420 --> 00:05:11,440 The machine crashed, your program crashed. 110 00:05:11,440 --> 00:05:11,870 Everything died. 111 00:05:11,870 --> 00:05:14,530 So you just have this fatalistic attitude. 112 00:05:14,530 --> 00:05:15,405 OK. 113 00:05:15,405 --> 00:05:16,880 It crashed. 114 00:05:16,880 --> 00:05:17,520 Everything is dead. 115 00:05:17,520 --> 00:05:19,000 So why bother? 116 00:05:19,000 --> 00:05:23,847 But in these clusters, if you have a lot of machines, if one 117 00:05:23,847 --> 00:05:25,550 machine dies, there's many others to pick up. 118 00:05:25,550 --> 00:05:29,650 So you can have a system that probably has availability much 119 00:05:29,650 --> 00:05:32,150 higher than what you can get on a single machine. 120 00:05:32,150 --> 00:05:34,360 And finally, cost savings. 121 00:05:34,360 --> 00:05:36,400 Because it's cheaper to do this, have a 122 00:05:36,400 --> 00:05:37,700 bunch of small machines. 123 00:05:37,700 --> 00:05:41,340 And businesses like Google has really taken 124 00:05:41,340 --> 00:05:42,590 advantage of that. 125 00:05:45,790 --> 00:05:48,520 So there are issues we have to deal with in order to program 126 00:05:48,520 --> 00:05:50,162 this damn thing. 127 00:05:50,162 --> 00:05:52,240 And if you want to get performance, you have to 128 00:05:52,240 --> 00:05:53,780 program in a way to get good performance. 129 00:05:53,780 --> 00:05:56,820 You don't run much slower and load less 130 00:05:56,820 --> 00:05:57,920 performance than one box. 131 00:05:57,920 --> 00:06:00,650 You'll get performance and also performance scalability. 132 00:06:00,650 --> 00:06:03,240 That means if you get 10 machines, you want to get some 133 00:06:03,240 --> 00:06:05,370 performance as if you have 20. 134 00:06:05,370 --> 00:06:07,370 Hopefully, you want to get a lot more performance than 20. 135 00:06:07,370 --> 00:06:09,630 So how do we keep things scaling in there? 136 00:06:09,630 --> 00:06:12,510 And also the thing's robustness. 137 00:06:12,510 --> 00:06:16,970 So the idea there is if you have one machine, you're 138 00:06:16,970 --> 00:06:17,210 fatalistic. 139 00:06:17,210 --> 00:06:18,730 If the machine goes, everything goes. 140 00:06:18,730 --> 00:06:20,080 You don't care. 141 00:06:20,080 --> 00:06:22,870 But if you have a lot more machines, you want to make 142 00:06:22,870 --> 00:06:26,100 sure that application runs even if the machine's fails. 143 00:06:26,100 --> 00:06:28,170 Worse, if you have a lot of machines, there's a lot more 144 00:06:28,170 --> 00:06:29,460 chance of failure. 145 00:06:29,460 --> 00:06:32,130 So if one goes down, everything crashes still. 146 00:06:32,130 --> 00:06:34,470 Then your application will be a lot less robust even than a 147 00:06:34,470 --> 00:06:35,990 single machine, because there too many 148 00:06:35,990 --> 00:06:37,845 moving parts to go wrong. 149 00:06:37,845 --> 00:06:39,890 So you want to actually deal with this robustness. 150 00:06:39,890 --> 00:06:44,780 So that adds an entire new dimension in there. 151 00:06:44,780 --> 00:06:47,050 We are not going to go too much deeper into robustness. 152 00:06:47,050 --> 00:06:50,340 But that is one big thing that you have to really worry about 153 00:06:50,340 --> 00:06:53,570 when you go to distributed systems. 154 00:06:53,570 --> 00:06:53,830 OK. 155 00:06:53,830 --> 00:06:55,700 What's a distributed system? 156 00:06:55,700 --> 00:07:00,400 So this is what we have been working so far? 157 00:07:00,400 --> 00:07:02,080 Can we see if we can reduce the lights? 158 00:07:06,980 --> 00:07:08,480 I guess up there you can't-- 159 00:07:08,480 --> 00:07:10,780 OK. 160 00:07:10,780 --> 00:07:12,030 We don't go fully dark, we'll see. 161 00:07:16,450 --> 00:07:19,680 Oh, that's you guys. 162 00:07:19,680 --> 00:07:22,420 Don't go to sleep even though light is-- 163 00:07:22,420 --> 00:07:23,270 there. 164 00:07:23,270 --> 00:07:25,510 So this should be over there and I don't have any way to 165 00:07:25,510 --> 00:07:26,700 darken this side. 166 00:07:26,700 --> 00:07:29,580 So these are the machines we have been thinking about. 167 00:07:29,580 --> 00:07:31,020 We have a memory system. 168 00:07:31,020 --> 00:07:34,170 And more than just having a shared memory, 169 00:07:34,170 --> 00:07:35,760 we have cache coherence. 170 00:07:35,760 --> 00:07:39,230 So that means if two people want to communicate to write 171 00:07:39,230 --> 00:07:42,250 to this single memory location, and the lot of that 172 00:07:42,250 --> 00:07:46,000 data appears lower on all the different cores. 173 00:07:46,000 --> 00:07:48,770 So we can use that information to basically communicate to 174 00:07:48,770 --> 00:07:49,320 the processor. 175 00:07:49,320 --> 00:07:51,350 That's really nice. 176 00:07:51,350 --> 00:07:57,040 So a distributed memory machine has no shared memory. 177 00:07:57,040 --> 00:07:58,050 So each memory is-- 178 00:07:58,050 --> 00:07:59,995 Now, how are you going to communicate? 179 00:08:03,130 --> 00:08:04,490 Message? 180 00:08:04,490 --> 00:08:06,000 Yeah, this is not software. 181 00:08:06,000 --> 00:08:08,100 You actually need something additional. 182 00:08:08,100 --> 00:08:11,060 Something like a network, or Internet, something behind 183 00:08:11,060 --> 00:08:13,360 sitting out that actually let you communicate 184 00:08:13,360 --> 00:08:15,830 between each other. 185 00:08:15,830 --> 00:08:19,690 So if you just really look at the kind of cost, this is a 186 00:08:19,690 --> 00:08:21,740 back of the envelope type calculation. 187 00:08:21,740 --> 00:08:23,710 Register is probably one cycle. 188 00:08:23,710 --> 00:08:26,090 Cache is about 10 cycles. 189 00:08:26,090 --> 00:08:28,940 If you go to DRAM, you can get about 1,000 cycles. 190 00:08:28,940 --> 00:08:32,520 Remote memory, going somewhere across, is, again, another 191 00:08:32,520 --> 00:08:34,480 order of magnitude from that. 192 00:08:34,480 --> 00:08:36,090 So of course, you keep adding. 193 00:08:36,090 --> 00:08:37,789 And that's probably the reason that sometimes 194 00:08:37,789 --> 00:08:38,370 things can be slow. 195 00:08:38,370 --> 00:08:41,370 Because now, we have another layer that's even slower. 196 00:08:41,370 --> 00:08:46,610 So we have to think about it, worry about it when you're 197 00:08:46,610 --> 00:08:48,880 writing code for these types of machines. 198 00:08:48,880 --> 00:08:53,190 So in shared memory machines, we learn in 199 00:08:53,190 --> 00:08:54,750 languages like Cilk. 200 00:08:54,750 --> 00:08:57,300 It's very nice to communicate because we 201 00:08:57,300 --> 00:08:58,930 synchronize via locks. 202 00:08:58,930 --> 00:09:01,780 And all communication via memory. 203 00:09:01,780 --> 00:09:03,560 Because when you write something, if you look at that 204 00:09:03,560 --> 00:09:05,990 memory location, everybody else will see it. 205 00:09:05,990 --> 00:09:08,275 And if you put the right synchronization, hopefully you 206 00:09:08,275 --> 00:09:10,600 will get the value you want. 207 00:09:10,600 --> 00:09:13,620 In distributed memory machines, there's 208 00:09:13,620 --> 00:09:14,820 nothing like that. 209 00:09:14,820 --> 00:09:16,700 So what we see is we explicitly 210 00:09:16,700 --> 00:09:18,510 sends some data across. 211 00:09:18,510 --> 00:09:20,240 So you have what we call messages. 212 00:09:20,240 --> 00:09:23,350 And that means if you want to send something to-- if another 213 00:09:23,350 --> 00:09:25,286 person needs to look at something, we have to send it 214 00:09:25,286 --> 00:09:26,260 to that person. 215 00:09:26,260 --> 00:09:27,755 So you have to originate yourself. 216 00:09:27,755 --> 00:09:28,470 Saying, I'm sending. 217 00:09:28,470 --> 00:09:30,100 That other person has to receive it. 218 00:09:30,100 --> 00:09:32,950 And they have to put it wherever you want. 219 00:09:32,950 --> 00:09:36,620 So everybody's address space is separate. 220 00:09:36,620 --> 00:09:38,820 And if you want to synchronize, you would also do 221 00:09:38,820 --> 00:09:39,540 it through the message. 222 00:09:39,540 --> 00:09:41,940 So you send a message, then the other person wait for the 223 00:09:41,940 --> 00:09:43,730 message to come. 224 00:09:43,730 --> 00:09:50,550 And so this shows you what normally happens in messages. 225 00:09:50,550 --> 00:09:53,250 In the shared memory, there's nothing called message size. 226 00:09:53,250 --> 00:09:54,560 You write a cache line. 227 00:09:54,560 --> 00:09:56,070 The cache line moves. 228 00:09:56,070 --> 00:09:58,500 And you can't keep changing the cache line size. 229 00:09:58,500 --> 00:10:01,220 Hopefully, prefetcher will be good and do something nice. 230 00:10:01,220 --> 00:10:02,700 But you don't have that much choice. 231 00:10:02,700 --> 00:10:05,430 In messages, you can compose any size of message you want. 232 00:10:05,430 --> 00:10:10,930 So what this graph shows is the minimum cost and average 233 00:10:10,930 --> 00:10:13,720 cost of different size messages. 234 00:10:13,720 --> 00:10:16,030 So there's a couple of things to get out of this graph. 235 00:10:16,030 --> 00:10:20,590 One is that if the message is even 0 length, or very small, 236 00:10:20,590 --> 00:10:21,670 you still have overhead. 237 00:10:21,670 --> 00:10:23,240 You're going to send the darn message. 238 00:10:23,240 --> 00:10:27,000 So if even you send nothing, it cost you some amount. 239 00:10:27,000 --> 00:10:29,670 And the second thing is as the message gets bigger and 240 00:10:29,670 --> 00:10:31,820 bigger, the cost keeps increasing, because now you're 241 00:10:31,820 --> 00:10:33,640 sending more and more data. 242 00:10:33,640 --> 00:10:36,670 So if you really amortize the overhead cost, you are to send 243 00:10:36,670 --> 00:10:38,320 large messages in there. 244 00:10:38,320 --> 00:10:41,600 Another thing this chart shows is that as messages become 245 00:10:41,600 --> 00:10:45,870 bigger, the kind of the distribution of overhead is 246 00:10:45,870 --> 00:10:47,440 all over the map. 247 00:10:47,440 --> 00:10:50,840 Because now we are sending large things, a lot of other 248 00:10:50,840 --> 00:10:52,150 craziness happens to these things. 249 00:10:52,150 --> 00:10:53,910 So sometimes it can go fast, sometimes it 250 00:10:53,910 --> 00:10:54,730 can be pretty slow. 251 00:10:54,730 --> 00:10:57,560 I will get why it might be sometimes this kind of 252 00:10:57,560 --> 00:10:59,610 distribution shortly. 253 00:10:59,610 --> 00:11:02,790 So the main point is that, that you don't send smaller 254 00:11:02,790 --> 00:11:06,880 messages if you can, because the overhead is too high. 255 00:11:06,880 --> 00:11:08,060 So why is this? 256 00:11:08,060 --> 00:11:11,310 Why is sending messages complicated? 257 00:11:11,310 --> 00:11:13,940 Till now, there's nobody sitting 258 00:11:13,940 --> 00:11:16,290 between you and hardware. 259 00:11:16,290 --> 00:11:19,550 Once you send the program run, you own the entire hardware, 260 00:11:19,550 --> 00:11:24,060 and after figuring out all the weirdness that's on x86 261 00:11:24,060 --> 00:11:25,670 there's nothing in between you. 262 00:11:25,670 --> 00:11:27,920 You probably won't look at the compile code if you look at 263 00:11:27,920 --> 00:11:29,940 what assembly is generated you have full view 264 00:11:29,940 --> 00:11:31,770 what's going on in here. 265 00:11:31,770 --> 00:11:34,870 Unfortunately, message passing, a lot of other things 266 00:11:34,870 --> 00:11:35,480 come into play. 267 00:11:35,480 --> 00:11:37,320 So if you want to send a message and the applications 268 00:11:37,320 --> 00:11:39,700 says, aha, I'm sending a message. 269 00:11:39,700 --> 00:11:41,960 And normally, it will do a system call 270 00:11:41,960 --> 00:11:43,180 to operating system. 271 00:11:43,180 --> 00:11:45,460 And normally, this message will get copied into the 272 00:11:45,460 --> 00:11:46,860 operating system. 273 00:11:46,860 --> 00:11:48,210 It's copying here. 274 00:11:48,210 --> 00:11:50,680 This operating system called the operating system wakes up. 275 00:11:50,680 --> 00:11:53,500 This might be when this scheduled, there's a lot of 276 00:11:53,500 --> 00:11:54,350 things going on. 277 00:11:54,350 --> 00:11:56,390 And then the operating system has to send to the network 278 00:11:56,390 --> 00:11:57,620 interface card. 279 00:11:57,620 --> 00:12:00,230 And the network will say, OK, I can't send long messages. 280 00:12:00,230 --> 00:12:03,930 I'm going to break into a bunch of small messages. 281 00:12:03,930 --> 00:12:05,600 And put some hardware here. 282 00:12:05,600 --> 00:12:07,720 And it will end up in the other side. 283 00:12:07,720 --> 00:12:11,600 In a bunch of fragmented small pieces that the network 284 00:12:11,600 --> 00:12:14,440 interface unit has to reassemble into one message 285 00:12:14,440 --> 00:12:15,570 and deliver up. 286 00:12:15,570 --> 00:12:16,840 And this will probably-- 287 00:12:16,840 --> 00:12:19,660 it will copy back into the application. 288 00:12:19,660 --> 00:12:22,120 So what that means is that a lot of other things getting 289 00:12:22,120 --> 00:12:24,920 involved, each optimize separately, doing a lot of 290 00:12:24,920 --> 00:12:26,090 different things. 291 00:12:26,090 --> 00:12:29,610 And so that is why you have this big unpredictable mess 292 00:12:29,610 --> 00:12:32,300 happening in message passing. 293 00:12:32,300 --> 00:12:37,050 And so there you not only have to worry about your code. 294 00:12:37,050 --> 00:12:38,730 You have to worry about what the operating system is doing. 295 00:12:38,730 --> 00:12:40,590 You have to worry about what the network is doing. 296 00:12:40,590 --> 00:12:43,060 You have to worry about your network card's doing. 297 00:12:43,060 --> 00:12:44,930 So there's a lot of moving parts in this. 298 00:12:44,930 --> 00:12:47,180 If you want to get really, really good performance, 299 00:12:47,180 --> 00:12:51,090 people have to worry about all these things in here. 300 00:12:51,090 --> 00:12:54,650 So let's look at how a message works. 301 00:12:54,650 --> 00:12:56,150 So I hope you can see these diagrams. 302 00:12:56,150 --> 00:12:58,060 Can you see these? 303 00:12:58,060 --> 00:12:58,530 Barely? 304 00:12:58,530 --> 00:12:59,400 So let me say-- 305 00:12:59,400 --> 00:13:01,675 So I have a sending process and a receiving process. 306 00:13:01,675 --> 00:13:03,110 Oh, don't reboot please. 307 00:13:08,160 --> 00:13:11,320 And so what happens if-- this is a message-- if we are 308 00:13:11,320 --> 00:13:14,280 sending without any buffering of a message, that means I am 309 00:13:14,280 --> 00:13:16,250 not copying it anywhere, so assume I 310 00:13:16,250 --> 00:13:17,170 want to send a message. 311 00:13:17,170 --> 00:13:19,650 I said I have a message to send. 312 00:13:19,650 --> 00:13:23,260 And then what happens in this model is, OK, until the 313 00:13:23,260 --> 00:13:26,440 receiver is ready, you have to wait. 314 00:13:26,440 --> 00:13:29,860 Because there is no place to send the message. 315 00:13:29,860 --> 00:13:33,620 So finally, when the other side says I want to receive 316 00:13:33,620 --> 00:13:36,440 something, it will tell this thing it's OK to send. 317 00:13:36,440 --> 00:13:37,770 And it will copy the data. 318 00:13:37,770 --> 00:13:42,410 And then after copying the data, both parts can continue. 319 00:13:42,410 --> 00:13:48,510 So this is what happens if sender wants to send early. 320 00:13:48,510 --> 00:13:50,830 If you're very lucky, the minute you try to send, the 321 00:13:50,830 --> 00:13:52,250 receiver says I want it. 322 00:13:52,250 --> 00:13:54,170 And we have very little delay. 323 00:13:54,170 --> 00:13:55,660 And everything gets copied. 324 00:13:55,660 --> 00:13:57,410 And that's in your lucky case. 325 00:13:57,410 --> 00:13:59,880 In other cases, the receiver wants some data. 326 00:14:02,760 --> 00:14:05,580 But the sender is not ready, so your receiver has to wait 327 00:14:05,580 --> 00:14:07,010 until the sender wants to send it. 328 00:14:07,010 --> 00:14:07,770 And when this message [? is ?] 329 00:14:07,770 --> 00:14:10,220 [? sent ?], you copy the data in here. 330 00:14:10,220 --> 00:14:13,430 So this is a very naive simple way. 331 00:14:13,430 --> 00:14:16,570 What can we eliminate out of this? 332 00:14:16,570 --> 00:14:20,132 How can we make it a little bit faster? 333 00:14:20,132 --> 00:14:22,480 AUDIENCE: Buffer. 334 00:14:22,480 --> 00:14:22,840 PROFESSOR: Yeah. 335 00:14:22,840 --> 00:14:24,410 If you buffer, what will eliminate? 336 00:14:24,410 --> 00:14:26,090 What will go away? 337 00:14:26,090 --> 00:14:28,580 Out of-- we have this overhead, this overhead, and 338 00:14:28,580 --> 00:14:30,390 this overhead. 339 00:14:30,390 --> 00:14:31,640 Which overheads can get eliminated? 340 00:14:35,390 --> 00:14:38,110 Wait for send can go ahead. 341 00:14:38,110 --> 00:14:39,260 So what happens is-- 342 00:14:39,260 --> 00:14:45,750 So here actually what they're showing is buffering also with 343 00:14:45,750 --> 00:14:46,710 some hardware support. 344 00:14:46,710 --> 00:14:49,730 That means I am trying to send something. 345 00:14:49,730 --> 00:14:52,570 And the minute I copied it out there, I can 346 00:14:52,570 --> 00:14:53,540 keep working in there. 347 00:14:53,540 --> 00:14:54,870 And somewhere in the background where you send the 348 00:14:54,870 --> 00:14:56,295 data, it will arrive here. 349 00:14:56,295 --> 00:14:59,740 And if it wants it, the data is there. 350 00:14:59,740 --> 00:15:02,210 Of course, if the receiver comes early and asks for data, 351 00:15:02,210 --> 00:15:02,980 you can't do that. 352 00:15:02,980 --> 00:15:07,100 Still you have to wait, because the data is not there. 353 00:15:07,100 --> 00:15:09,600 However, if there's no hardware support, both has to 354 00:15:09,600 --> 00:15:11,790 probably wait a little bit, because you have to get the 355 00:15:11,790 --> 00:15:12,540 data copied. 356 00:15:12,540 --> 00:15:14,380 So if you have a lot of hardware support, you don't 357 00:15:14,380 --> 00:15:15,690 see this copy time. 358 00:15:15,690 --> 00:15:18,960 But if there's no hardware support, you see some copy 359 00:15:18,960 --> 00:15:21,580 time going in here. 360 00:15:21,580 --> 00:15:25,150 So what's the advantage of this versus-- 361 00:15:25,150 --> 00:15:27,070 OK, tell me one advantage of this 362 00:15:27,070 --> 00:15:30,730 method versus this method. 363 00:15:30,730 --> 00:15:33,600 So of course, this one there is a lot of wait time and 364 00:15:33,600 --> 00:15:34,180 stuff like that. 365 00:15:34,180 --> 00:15:35,120 We know that. 366 00:15:35,120 --> 00:15:38,370 But is there any advantage of doing this one, this waiting 367 00:15:38,370 --> 00:15:40,990 until sending and sending it there versus this kind of a 368 00:15:40,990 --> 00:15:42,944 nice sending it in the background. 369 00:15:42,944 --> 00:15:43,912 AUDIENCE: They're synchronized. 370 00:15:43,912 --> 00:15:44,396 PROFESSOR: Hmm? 371 00:15:44,396 --> 00:15:46,820 AUDIENCE: Sychronized. 372 00:15:46,820 --> 00:15:48,520 PROFESSOR: Synchronized is one advantage. 373 00:15:48,520 --> 00:15:51,780 What else might happen? 374 00:15:51,780 --> 00:15:53,470 So what else are you going to do to get this 375 00:15:53,470 --> 00:15:54,850 kind of thing working? 376 00:16:00,960 --> 00:16:03,200 So in order for this to make progress, what do you have to 377 00:16:03,200 --> 00:16:05,970 do to data? 378 00:16:05,970 --> 00:16:07,020 It has to copy. 379 00:16:07,020 --> 00:16:08,910 So it has to get multiple copies. 380 00:16:08,910 --> 00:16:10,330 So from application space. 381 00:16:10,330 --> 00:16:12,520 It has to get copied to operating system space. 382 00:16:12,520 --> 00:16:14,980 It has to get copied into the networking stack. 383 00:16:14,980 --> 00:16:16,920 So data keep getting copying, and copying, 384 00:16:16,920 --> 00:16:18,080 and copying in there. 385 00:16:18,080 --> 00:16:19,830 And in here you basically don't copy. 386 00:16:19,830 --> 00:16:21,070 You just say, OK, wait. 387 00:16:21,070 --> 00:16:22,980 I'll keep the data and when you're ready, I will send it 388 00:16:22,980 --> 00:16:24,020 directly in here. 389 00:16:24,020 --> 00:16:26,400 And you can directly probably even send it to the network. 390 00:16:26,400 --> 00:16:27,550 And send it. 391 00:16:27,550 --> 00:16:32,670 So if you're sending a lot of data, copy my old value. 392 00:16:32,670 --> 00:16:35,580 So this might even be better if you're sending a huge 393 00:16:35,580 --> 00:16:36,760 amount of data. 394 00:16:36,760 --> 00:16:40,130 So that's one advantage of having system like that. 395 00:16:40,130 --> 00:16:42,080 And of course, hardware-- 396 00:16:42,080 --> 00:16:44,805 if there's no hardware support, basically still you 397 00:16:44,805 --> 00:16:46,220 have to do some copying in here. 398 00:16:52,950 --> 00:16:55,264 So this is-- 399 00:16:55,264 --> 00:16:57,580 what am I showing here? 400 00:16:57,580 --> 00:17:01,160 So what we are showing in here is non-blocking. 401 00:17:01,160 --> 00:17:06,119 So one way to look at that is when you're sending, when you 402 00:17:06,119 --> 00:17:11,079 request for send, what you can say is, OK, I continue but I 403 00:17:11,079 --> 00:17:12,099 haven't copied the data. 404 00:17:12,099 --> 00:17:14,089 I have my data in here, but I'm doing that. 405 00:17:14,089 --> 00:17:16,524 But what I must tell you, OK, look, this data still hasn't 406 00:17:16,524 --> 00:17:18,150 moved out of my space yet. 407 00:17:18,150 --> 00:17:18,849 So I have to worry. 408 00:17:18,849 --> 00:17:19,880 I can't rewrite the data. 409 00:17:19,880 --> 00:17:23,030 And at some point, when you say I want the data, it will 410 00:17:23,030 --> 00:17:25,730 go there and bring the data for you. 411 00:17:25,730 --> 00:17:27,440 And catch you like that. 412 00:17:27,440 --> 00:17:30,440 So between this time since I don't want to make too many 413 00:17:30,440 --> 00:17:32,840 copies, I have to make sure that I don't touch that data. 414 00:17:32,840 --> 00:17:34,060 Or I have to copy it. 415 00:17:34,060 --> 00:17:36,900 So that's my request in here. 416 00:17:36,900 --> 00:17:43,590 And of course, if you have no hardware support, you have to 417 00:17:43,590 --> 00:17:48,210 put some time into actually doing the copying. 418 00:17:48,210 --> 00:17:52,600 So this is nice. 419 00:17:52,600 --> 00:17:55,370 But we want to have a little bit of high level 420 00:17:55,370 --> 00:17:56,730 support to do this. 421 00:17:56,730 --> 00:18:01,840 So this is not as nice as things like Cilk, because you 422 00:18:01,840 --> 00:18:03,180 don't have to worry about a lot of other interesting 423 00:18:03,180 --> 00:18:04,360 things going on. 424 00:18:04,360 --> 00:18:08,130 So what people have developed is called MPI language, 425 00:18:08,130 --> 00:18:10,040 Message Passing Interface language. 426 00:18:10,040 --> 00:18:12,580 It is kind of a bit foggy. 427 00:18:12,580 --> 00:18:15,470 But that's the best people have these days. 428 00:18:15,470 --> 00:18:19,880 A machine independent way of when have the distributed 429 00:18:19,880 --> 00:18:22,760 systems to communicate with each other. 430 00:18:22,760 --> 00:18:23,360 So-- 431 00:18:23,360 --> 00:18:24,812 [PHONE RINGING] 432 00:18:24,812 --> 00:18:25,780 Whoops. 433 00:18:25,780 --> 00:18:26,466 That's not good. 434 00:18:26,466 --> 00:18:27,716 My phone. 435 00:18:31,140 --> 00:18:32,900 Sorry about that. 436 00:18:32,900 --> 00:18:35,540 So what happens is each machine has its own processor, 437 00:18:35,540 --> 00:18:37,220 it's own memory. 438 00:18:37,220 --> 00:18:39,280 So there's no shared memory on a thing like that. 439 00:18:39,280 --> 00:18:41,180 Its own thread of control is run. 440 00:18:41,180 --> 00:18:45,280 And each process communicates via messages. 441 00:18:45,280 --> 00:18:49,260 And there is send as is needed. 442 00:18:49,260 --> 00:18:53,430 And that means but you can't send like pointers, because 443 00:18:53,430 --> 00:18:54,800 there's no notion of pointers. 444 00:18:54,800 --> 00:18:56,360 You actually have a data structure that's 445 00:18:56,360 --> 00:18:59,140 self-contained center of the site. 446 00:18:59,140 --> 00:19:00,630 So here's a small program. 447 00:19:00,630 --> 00:19:02,650 I'm going to walk through that. 448 00:19:02,650 --> 00:19:03,750 So I have main. 449 00:19:03,750 --> 00:19:05,620 And I'm setting a bunch of these variables. 450 00:19:05,620 --> 00:19:08,570 For now, those are not that important. 451 00:19:08,570 --> 00:19:10,820 But for completeness, I have that. 452 00:19:10,820 --> 00:19:12,920 And then, of course, if use something like MPI, there's a 453 00:19:12,920 --> 00:19:14,490 bunch of setup things that you have. 454 00:19:14,490 --> 00:19:18,000 And so basically like cut and paste with what people 455 00:19:18,000 --> 00:19:19,690 normally do as we set up. 456 00:19:19,690 --> 00:19:23,770 And then I have this piece of code. 457 00:19:27,490 --> 00:19:32,210 This piece of code, what it does is, this same program 458 00:19:32,210 --> 00:19:37,000 runs on multiple different machines. 459 00:19:37,000 --> 00:19:38,860 So everyone has the same program. 460 00:19:38,860 --> 00:19:41,160 But then at some point, I want to know in my 461 00:19:41,160 --> 00:19:42,320 machine what to do. 462 00:19:42,320 --> 00:19:45,760 So what I do is I check who am I? 463 00:19:45,760 --> 00:19:46,610 Am I machine zero? 464 00:19:46,610 --> 00:19:48,830 If I'm machine zero, do this. 465 00:19:48,830 --> 00:19:50,940 If I'm machine one, do this. 466 00:19:50,940 --> 00:19:52,830 So by doing that, I can cite a piece of code 467 00:19:52,830 --> 00:19:54,090 that everybody runs. 468 00:19:54,090 --> 00:19:56,140 And everybody figures out who they are. 469 00:19:56,140 --> 00:19:59,310 And if they are the given thing, what to do. 470 00:19:59,310 --> 00:20:03,920 So here what it says is, OK, if I'm machine zero, my source 471 00:20:03,920 --> 00:20:05,420 and destination is machine one. 472 00:20:05,420 --> 00:20:06,930 If I'm machine one, my source and 473 00:20:06,930 --> 00:20:08,070 destination is machine zero. 474 00:20:08,070 --> 00:20:10,880 So I'm trying to communicate between each other. 475 00:20:10,880 --> 00:20:16,110 So if you look at what happens is, first, I am sending 476 00:20:16,110 --> 00:20:18,870 basically to this machine. 477 00:20:18,870 --> 00:20:22,670 So I'm sending something into this machine, so the syntax-- 478 00:20:22,670 --> 00:20:24,150 I'm not going to go through that. 479 00:20:24,150 --> 00:20:25,240 You don't have to know that. 480 00:20:25,240 --> 00:20:26,760 But what you need to know is that I'm 481 00:20:26,760 --> 00:20:27,780 trying to send something. 482 00:20:27,780 --> 00:20:29,686 I tell explicitly who to send. 483 00:20:29,686 --> 00:20:32,550 And there has to be matching receiving that data. 484 00:20:32,550 --> 00:20:33,670 Otherwise, sends go somewhere. 485 00:20:33,670 --> 00:20:36,400 And just it goes bad. 486 00:20:36,400 --> 00:20:37,990 Send here, you can send it. 487 00:20:37,990 --> 00:20:39,520 It can probably go bad. 488 00:20:39,520 --> 00:20:41,780 But receive you have to have somebody who sends for that. 489 00:20:41,780 --> 00:20:43,680 So the receive basically has to have matching. 490 00:20:43,680 --> 00:20:45,910 And then you send it that direction. 491 00:20:45,910 --> 00:20:50,740 And then what I do is I receive in here. 492 00:20:50,740 --> 00:20:53,640 And this gets sent to me in here. 493 00:20:53,640 --> 00:20:56,320 So question I did send receive here. 494 00:20:56,320 --> 00:20:59,450 What would happen if I did also send receive here? 495 00:20:59,450 --> 00:21:03,100 If I reorganized these two, what would happen? 496 00:21:03,100 --> 00:21:06,240 If I used the same piece of code, that two pieces of code. 497 00:21:06,240 --> 00:21:08,250 Then I don't even have do a bit to make this 498 00:21:08,250 --> 00:21:08,910 two separate code. 499 00:21:08,910 --> 00:21:11,910 I can basically factor this out down here. 500 00:21:11,910 --> 00:21:14,070 I do a send, receive; send, receive here. 501 00:21:14,070 --> 00:21:16,930 And then send the just IDs. 502 00:21:19,950 --> 00:21:21,200 What happen? 503 00:21:29,194 --> 00:21:30,444 AUDIENCE: It works without a buffer. 504 00:21:33,110 --> 00:21:34,210 PROFESSOR: These things are what you 505 00:21:34,210 --> 00:21:35,620 called blocking sends. 506 00:21:35,620 --> 00:21:38,040 If you have blocking sends, it means that until the receiver 507 00:21:38,040 --> 00:21:40,060 receives it might be blocked if you are 508 00:21:40,060 --> 00:21:41,710 doing a blocking send. 509 00:21:41,710 --> 00:21:42,200 OK. 510 00:21:42,200 --> 00:21:43,460 That means if two guys are trying to 511 00:21:43,460 --> 00:21:45,520 send, nobody is receiving. 512 00:21:45,520 --> 00:21:46,570 You have what? 513 00:21:46,570 --> 00:21:47,362 AUDIENCE: Deadlock. 514 00:21:47,362 --> 00:21:48,420 PROFESSOR: You have deadlock. 515 00:21:48,420 --> 00:21:49,960 So that's why I actually had to do this. 516 00:21:49,960 --> 00:21:51,630 This is called blocking send. 517 00:21:51,630 --> 00:21:55,220 So instead of blocking sends-- 518 00:21:55,220 --> 00:21:56,800 So of course, those are finalized things 519 00:21:56,800 --> 00:21:58,590 and do that up there. 520 00:21:58,590 --> 00:21:59,960 I can do this one. 521 00:21:59,960 --> 00:22:03,040 What this says is-- 522 00:22:03,040 --> 00:22:04,530 This is actually a little more complicated. 523 00:22:04,530 --> 00:22:06,990 What I'm doing is I have a bunch of buffers here. 524 00:22:06,990 --> 00:22:10,690 I have how many processors? 525 00:22:10,690 --> 00:22:12,500 I have bunch of buffers in here. 526 00:22:12,500 --> 00:22:17,450 I have, I guess, my ID number of processors-- 527 00:22:17,450 --> 00:22:19,560 no, numtask number of processors. 528 00:22:19,560 --> 00:22:21,870 What I am sending, say I'm sending a circular buffer. 529 00:22:21,870 --> 00:22:24,991 I'm sending around to everybody. 530 00:22:24,991 --> 00:22:26,350 Both directions. 531 00:22:26,350 --> 00:22:28,060 So I am sending the previous and next. 532 00:22:28,060 --> 00:22:30,520 So assume something is sitting in numtasks. 533 00:22:30,520 --> 00:22:31,960 I am sending back and forth. 534 00:22:31,960 --> 00:22:35,300 So here what I am doing is basically non-blocking sends 535 00:22:35,300 --> 00:22:35,980 and receives. 536 00:22:35,980 --> 00:22:38,700 So first time issuing a receive. 537 00:22:38,700 --> 00:22:41,110 So even if I receive a receive, it says I have intent 538 00:22:41,110 --> 00:22:41,660 to receive. 539 00:22:41,660 --> 00:22:42,950 But I am not receiving something. 540 00:22:42,950 --> 00:22:44,170 I am not waiting. 541 00:22:44,170 --> 00:22:45,500 So I can continue. 542 00:22:45,500 --> 00:22:46,940 And then I am doing the same. 543 00:22:46,940 --> 00:22:48,900 So otherwise, if I just do just receive and send, if you 544 00:22:48,900 --> 00:22:50,660 do blocking is going to be deadlocked. 545 00:22:50,660 --> 00:22:51,970 But here I do that. 546 00:22:51,970 --> 00:22:55,030 And then in this wait for all. 547 00:22:55,030 --> 00:22:57,590 What it says is, OK, now I issued a receive. 548 00:22:57,590 --> 00:23:00,010 Now wait until that receive is done. 549 00:23:00,010 --> 00:23:03,200 So before I use the data, I have to wait for it in there. 550 00:23:03,200 --> 00:23:09,650 And also, when I do the send, I am wait for all in here. 551 00:23:09,650 --> 00:23:15,190 So why do you think it might be advantageous to do a 552 00:23:15,190 --> 00:23:20,020 non-blocking receives and non-blocking sends? 553 00:23:20,020 --> 00:23:24,450 So sends, it makes perfect sense, because once I have 554 00:23:24,450 --> 00:23:26,480 sent, I won't do anything because I don't have to wait 555 00:23:26,480 --> 00:23:26,940 for anything. 556 00:23:26,940 --> 00:23:28,120 I am done. 557 00:23:28,120 --> 00:23:31,400 So blocking sends is not that useful. 558 00:23:31,400 --> 00:23:34,950 But receives, why do you want to do non-blocking receives, 559 00:23:34,950 --> 00:23:36,060 then a blocking receive? 560 00:23:36,060 --> 00:23:37,940 Because then you won't receive. 561 00:23:37,940 --> 00:23:39,310 You have to wait till the data comes to do anything. 562 00:23:39,310 --> 00:23:41,940 Because that's what non-blocking receives means. 563 00:23:41,940 --> 00:23:45,930 I have to receive early and then wait for the receives to 564 00:23:45,930 --> 00:23:48,270 happen at this point. 565 00:23:48,270 --> 00:23:50,840 What might be an advantage of doing a non-blocking receive? 566 00:23:53,650 --> 00:23:54,900 Anybody can think of an advantage? 567 00:23:59,800 --> 00:24:00,535 It's harder. 568 00:24:00,535 --> 00:24:04,300 Because now you have to remove the receives from the 569 00:24:04,300 --> 00:24:06,410 synchronization point instead of writing one receive. 570 00:24:09,612 --> 00:24:09,972 AUDIENCE: Because when sends come from other machines, we 571 00:24:09,972 --> 00:24:11,222 can receive it. 572 00:24:17,330 --> 00:24:19,150 PROFESSOR: That might be one interesting thing because you 573 00:24:19,150 --> 00:24:20,780 are expecting multiple receives. 574 00:24:20,780 --> 00:24:22,360 You don't know what's coming first. 575 00:24:22,360 --> 00:24:26,330 If you do non-blocking receive, then you can be-- 576 00:24:26,330 --> 00:24:29,710 opt out the first guy then basically work on. 577 00:24:29,710 --> 00:24:30,190 That's a very good point. 578 00:24:30,190 --> 00:24:30,390 OK. 579 00:24:30,390 --> 00:24:31,640 What else? 580 00:24:39,860 --> 00:24:41,445 What other advantages you might have? 581 00:24:41,445 --> 00:24:42,695 Having a non-blocking receive? 582 00:24:52,640 --> 00:24:54,909 AUDIENCE: If the receive fails, we can just have them 583 00:24:54,909 --> 00:24:57,570 resend it again. 584 00:24:57,570 --> 00:24:58,940 PROFESSOR: Receive fails? 585 00:24:58,940 --> 00:24:59,200 OK. 586 00:24:59,200 --> 00:24:59,350 Receive fails. 587 00:24:59,350 --> 00:25:01,090 See, that's complicated. 588 00:25:01,090 --> 00:25:03,900 But another interesting thing might be space. 589 00:25:03,900 --> 00:25:06,940 Because when I see the non-blocking receive, I know 590 00:25:06,940 --> 00:25:09,150 where the data has to be. 591 00:25:09,150 --> 00:25:12,060 So I already allocated a buffer for that. 592 00:25:12,060 --> 00:25:14,900 So, if the data comes now, I can directly copy into my 593 00:25:14,900 --> 00:25:16,900 local buffer if there's already a received issued, if 594 00:25:16,900 --> 00:25:18,790 there's already a space allocated. 595 00:25:18,790 --> 00:25:21,935 Normally, other way around, until you see the issue the 596 00:25:21,935 --> 00:25:23,830 received, you don't know where the data has to be, so it has 597 00:25:23,830 --> 00:25:25,590 to get copied at that point. 598 00:25:25,590 --> 00:25:28,900 So here, you can keep the buffer, and hopefully, if you 599 00:25:28,900 --> 00:25:32,680 are lucky, the same hasn't happened yet, so you should 600 00:25:32,680 --> 00:25:34,870 set up the buffer, and then, when the data comes, say, aha, 601 00:25:34,870 --> 00:25:36,140 here is the matching receive. 602 00:25:36,140 --> 00:25:38,580 Directly put it there by passing it and 603 00:25:38,580 --> 00:25:39,280 copying in the middle. 604 00:25:39,280 --> 00:25:42,110 So that's the advantage here. 605 00:25:42,110 --> 00:25:45,310 So, I am here. 606 00:25:45,310 --> 00:25:47,400 I did a wait for the receives here. 607 00:25:47,400 --> 00:25:48,950 Did the work that uses the data. 608 00:25:48,950 --> 00:25:52,760 And wait for sends here, afterwards. 609 00:25:52,760 --> 00:25:53,760 OK. 610 00:25:53,760 --> 00:25:57,080 Could I have moved wait for sends before the work? 611 00:26:02,370 --> 00:26:04,790 What happens if I have moved wait for 612 00:26:04,790 --> 00:26:06,096 sends before the work? 613 00:26:06,096 --> 00:26:08,130 Is it incorrect? 614 00:26:08,130 --> 00:26:09,460 How many people think this is incorrect? 615 00:26:12,450 --> 00:26:15,610 Is it incorrect to move this wait for sends? 616 00:26:15,610 --> 00:26:18,480 All the sends before the work, to move this item about? 617 00:26:18,480 --> 00:26:21,310 Because work is where all the work happens, I assume, that 618 00:26:21,310 --> 00:26:23,060 uses this data. 619 00:26:23,060 --> 00:26:25,751 So wait for sends about what happens. 620 00:26:25,751 --> 00:26:27,001 AUDIENCE: Well, if that's incorrect, then you lose... 621 00:26:28,980 --> 00:26:30,790 PROFESSOR: Yeah, you lose-- you're waiting for something 622 00:26:30,790 --> 00:26:32,060 that you don't have to wait. 623 00:26:32,060 --> 00:26:35,110 Of course, you can move these down, because that means you 624 00:26:35,110 --> 00:26:37,940 might start using, and try to use data that's not there. 625 00:26:37,940 --> 00:26:39,910 So, this has to be here. 626 00:26:39,910 --> 00:26:42,360 And this, basically, for performance 627 00:26:42,360 --> 00:26:44,205 purposes, has to be after. 628 00:26:48,900 --> 00:26:50,740 So, of course you have to worry about a lot of 629 00:26:50,740 --> 00:26:51,580 correctness issues. 630 00:26:51,580 --> 00:26:53,880 One is deadlocks. 631 00:26:53,880 --> 00:26:55,370 So, there are two types of deadlocks. 632 00:26:55,370 --> 00:26:57,480 That's blocking sends and receives, 633 00:26:57,480 --> 00:26:58,650 what we talked about. 634 00:26:58,650 --> 00:27:01,180 But there's also other types of deadlocks that happen 635 00:27:01,180 --> 00:27:04,060 because of resources. 636 00:27:04,060 --> 00:27:07,590 So, let me get to that in the next slide. 637 00:27:07,590 --> 00:27:09,940 And the other interesting thing that can 638 00:27:09,940 --> 00:27:11,450 happen is stale data. 639 00:27:11,450 --> 00:27:14,860 In your shared memory machine, need to update the data. 640 00:27:14,860 --> 00:27:17,140 You know everybody's going to see that. 641 00:27:17,140 --> 00:27:19,700 Because the hardware takes care of that. 642 00:27:19,700 --> 00:27:23,270 But, in a message passing machine, it's up to you get 643 00:27:23,270 --> 00:27:25,410 the latest data when it's needed. 644 00:27:25,410 --> 00:27:27,540 So, if you don't have the data, you think, aha, I have 645 00:27:27,540 --> 00:27:29,760 the data, but it might not be the right value because you 646 00:27:29,760 --> 00:27:31,200 haven't gotten something new. 647 00:27:31,200 --> 00:27:34,150 So, it's up to you to basically send the 648 00:27:34,150 --> 00:27:36,710 data out and that. 649 00:27:36,710 --> 00:27:41,610 And, also, robustness is a big issue because the fact that 650 00:27:41,610 --> 00:27:44,080 you have multiple machines means you can make it robust, 651 00:27:44,080 --> 00:27:46,950 but the other flip side is up to you to make it robust. 652 00:27:46,950 --> 00:27:49,880 So that means you have to figure out if a machine fails, 653 00:27:49,880 --> 00:27:51,110 how to respond to that. 654 00:27:51,110 --> 00:27:54,480 So, if you're waiting for a machine there for it fails. 655 00:27:54,480 --> 00:27:55,530 OK? 656 00:27:55,530 --> 00:27:57,700 There are a lot of issues, it time out, and then you have to 657 00:27:57,700 --> 00:27:59,390 go and deal with that. 658 00:27:59,390 --> 00:28:01,580 So, that can make the programming a lot more 659 00:28:01,580 --> 00:28:02,720 complicated. 660 00:28:02,720 --> 00:28:07,950 And if you just don't do that, your overall program would be 661 00:28:07,950 --> 00:28:10,670 a lot less robust than a single machine because there 662 00:28:10,670 --> 00:28:12,380 could be a lot more failures in the large system. 663 00:28:17,080 --> 00:28:21,550 So, here's a kind of deadlock that can happen. 664 00:28:21,550 --> 00:28:28,700 What I am doing is processor zero is sending and processor 665 00:28:28,700 --> 00:28:31,460 one is sending data to each other. 666 00:28:31,460 --> 00:28:35,400 It doesn't have a read write deadlock because I am for 667 00:28:35,400 --> 00:28:37,830 sending, sending and then I am receiving, receiving. 668 00:28:37,830 --> 00:28:40,680 The sends here, and the receives from here, the sends 669 00:28:40,680 --> 00:28:42,470 here, and then receive. 670 00:28:42,470 --> 00:28:44,690 So normally it looks like I'm sending two things, It should 671 00:28:44,690 --> 00:28:46,700 go, and I am receiving that. 672 00:28:46,700 --> 00:28:51,550 But, assuming that I am sending huge amount of data. 673 00:28:51,550 --> 00:28:51,660 OK. 674 00:28:51,660 --> 00:28:55,930 So I start sending, and there's not 675 00:28:55,930 --> 00:28:57,710 enough room for received. 676 00:28:57,710 --> 00:29:00,630 Just say, OK, I don't have any room to receive, I have to 677 00:29:00,630 --> 00:29:03,410 wait until the data, at least a data start getting consumed 678 00:29:03,410 --> 00:29:04,942 to start receiving. 679 00:29:04,942 --> 00:29:05,350 OK. 680 00:29:05,350 --> 00:29:08,410 If you keep sending multiple of send outs, you might do a 681 00:29:08,410 --> 00:29:10,270 multiple of send outs in more than one, multiple of sends 682 00:29:10,270 --> 00:29:13,310 here, I might get deadlocked, I might get blocked because 683 00:29:13,310 --> 00:29:15,550 they can have receive, receive, and he's also trying 684 00:29:15,550 --> 00:29:17,470 to send multiple things, I might get blocked in here. 685 00:29:17,470 --> 00:29:19,920 So I might get into this criss-cross situation. 686 00:29:19,920 --> 00:29:22,200 If you were trying to send something, but other guy can't 687 00:29:22,200 --> 00:29:24,000 proceed, up to get received. 688 00:29:24,000 --> 00:29:27,570 So even though, if program look like there's a nice 689 00:29:27,570 --> 00:29:29,680 matching send and receive, there's no cycles. 690 00:29:29,680 --> 00:29:33,280 There's a cycle created by the resource usage in here. 691 00:29:33,280 --> 00:29:37,870 So, if doing lot of sends before a lot receives, and 692 00:29:37,870 --> 00:29:39,670 vice versa, you have to be careful. 693 00:29:39,670 --> 00:29:42,750 If you do too much of that, it's nice to block things, 694 00:29:42,750 --> 00:29:43,610 move things up. 695 00:29:43,610 --> 00:29:46,120 But if you have too many things, then you might end up 696 00:29:46,120 --> 00:29:47,230 in deadlock situation. 697 00:29:47,230 --> 00:29:49,402 Even though, traditionally, it might not happen. 698 00:29:52,000 --> 00:29:55,620 So, you have host of other performance issues 699 00:29:55,620 --> 00:29:57,030 that you'll deal with. 700 00:29:57,030 --> 00:30:01,140 So let me address couple of them and what 701 00:30:01,140 --> 00:30:02,780 it might shows up. 702 00:30:02,780 --> 00:30:06,620 So one big thing is occupancy cost. 703 00:30:06,620 --> 00:30:14,320 Because, when you do shared memory, minute you basically 704 00:30:14,320 --> 00:30:15,240 showing instructions. 705 00:30:15,240 --> 00:30:18,300 Instruction goes executes, you are done with and that 706 00:30:18,300 --> 00:30:19,550 operation is finished. 707 00:30:22,820 --> 00:30:26,250 When you are doing a message passing, each 708 00:30:26,250 --> 00:30:27,070 message is very expensive. 709 00:30:27,070 --> 00:30:28,510 It has to do a lot of things. 710 00:30:28,510 --> 00:30:31,260 You have to do a context switch, do a buffer copy, and 711 00:30:31,260 --> 00:30:34,580 a protocol stack processing, and then might have to do 712 00:30:34,580 --> 00:30:36,220 another context switch for that. 713 00:30:36,220 --> 00:30:39,180 And there's a lot of these copying and stuff happening, 714 00:30:39,180 --> 00:30:43,880 and the network controller might interrupt the private 715 00:30:43,880 --> 00:30:46,640 system, because either there's data coming, copy the data in 716 00:30:46,640 --> 00:30:49,410 there, and you send as ignored application. 717 00:30:49,410 --> 00:30:51,300 So there's this huge amount of things happening 718 00:30:51,300 --> 00:30:52,790 just for one message. 719 00:30:52,790 --> 00:30:57,120 If you are sending like one lousy byte, or one even like 720 00:30:57,120 --> 00:31:00,690 kilobyte., it just doing a lot of millions of instructions, 721 00:31:00,690 --> 00:31:03,180 just on behalf of small amount of data. 722 00:31:03,180 --> 00:31:04,740 So that's a large amount of cost 723 00:31:04,740 --> 00:31:05,990 associated with that overhead. 724 00:31:08,710 --> 00:31:10,920 So, setup already is very high. 725 00:31:10,920 --> 00:31:14,370 And, so what you want to do is you want to amortize the cost 726 00:31:14,370 --> 00:31:16,320 by sending large messages. 727 00:31:16,320 --> 00:31:18,190 So what you were to say, so look I'm not sending this 728 00:31:18,190 --> 00:31:20,390 small thing, I'm going to accumulate a lot of things, 729 00:31:20,390 --> 00:31:22,820 I'm going to send everything as bulk if you can. 730 00:31:22,820 --> 00:31:24,940 And then you can basically amortize these 731 00:31:24,940 --> 00:31:26,190 costs of doing things. 732 00:31:28,700 --> 00:31:32,450 Other thing is communications is excruciatingly slow. 733 00:31:32,450 --> 00:31:37,220 So even the memory system, it's about probably a couple 734 00:31:37,220 --> 00:31:43,140 of hundred plus compared to CPU communicating. 735 00:31:43,140 --> 00:31:47,190 In the cluster interconnect, you can do tens of thousands 736 00:31:47,190 --> 00:31:53,880 of instructions in the CPU, by the time it get communicated. 737 00:31:53,880 --> 00:31:57,820 In a grid, or if you are doing through the internet, and then 738 00:31:57,820 --> 00:31:58,910 it sits in the seconds now. 739 00:31:58,910 --> 00:31:59,790 You can actually feel it. 740 00:31:59,790 --> 00:32:01,895 And then your processor can run millions of instructions 741 00:32:01,895 --> 00:32:03,220 in that time. 742 00:32:03,220 --> 00:32:06,440 And so if you are waiting for something, you had to wait for 743 00:32:06,440 --> 00:32:07,460 a very long time. 744 00:32:07,460 --> 00:32:09,220 They might not have enough things to stuff in the middle, 745 00:32:09,220 --> 00:32:12,420 to kind of amortize the cost. 746 00:32:12,420 --> 00:32:13,560 And then you have to worry about that. 747 00:32:13,560 --> 00:32:17,180 So what that means is if you start waiting for something to 748 00:32:17,180 --> 00:32:19,420 happen, you are waiting for a long time. 749 00:32:19,420 --> 00:32:22,160 So you have to figure out putting things in there. 750 00:32:22,160 --> 00:32:25,775 Not waiting this, non-blocking things is very important 751 00:32:25,775 --> 00:32:28,310 because of that. 752 00:32:28,310 --> 00:32:37,300 And so normally what you want to do is you want to have 753 00:32:37,300 --> 00:32:40,260 always split operations, that means you want to kind of 754 00:32:40,260 --> 00:32:43,680 initiate something at some point, very early on, and then 755 00:32:43,680 --> 00:32:45,870 use it later, especially if you're looking for something 756 00:32:45,870 --> 00:32:46,770 like a receive. 757 00:32:46,770 --> 00:32:50,200 So if, I want to, especially if I want to get some-- 758 00:32:50,200 --> 00:32:53,180 normally in the shared memory, just kind of doing a simple 759 00:32:53,180 --> 00:32:55,690 thing, asking something and get replies, very simple. 760 00:32:55,690 --> 00:32:58,370 Here, because if you just do that, there's a huge waiting 761 00:32:58,370 --> 00:33:01,230 bit rate, so you want to kind of do speed operations. 762 00:33:01,230 --> 00:33:03,390 So, this code can be very complicated, 763 00:33:03,390 --> 00:33:04,830 because you tried to-- 764 00:33:04,830 --> 00:33:07,240 if you want to ask something, you ask very early, you do a 765 00:33:07,240 --> 00:33:10,585 lot of other things before the reply comes. 766 00:33:10,585 --> 00:33:12,490 AUDIENCE: [INAUDIBLE]. 767 00:33:12,490 --> 00:33:13,740 PROFESSOR: Oops. 768 00:33:16,750 --> 00:33:17,220 OK. 769 00:33:17,220 --> 00:33:20,330 Let's see how many times I had to press that. 770 00:33:20,330 --> 00:33:21,580 Before I realize I have to reboot. 771 00:33:24,810 --> 00:33:28,260 So, if you want to rendezvous with-- 772 00:33:28,260 --> 00:33:30,480 normally, what have that means that two points has to kind of 773 00:33:30,480 --> 00:33:31,890 synchronize at the same point. 774 00:33:31,890 --> 00:33:34,520 So what you want to do is you can do a three-way sender send 775 00:33:34,520 --> 00:33:38,055 a request, receiver acks with, OK, it's OK to send, and the 776 00:33:38,055 --> 00:33:39,130 senders delivers the data. 777 00:33:39,130 --> 00:33:41,140 So that means I have to do a three-way communication. 778 00:33:41,140 --> 00:33:43,620 Or, this alternative with two-way, you sender doesn't 779 00:33:43,620 --> 00:33:46,560 send anything, receiver basically send a request, and 780 00:33:46,560 --> 00:33:47,270 then you send the data. 781 00:33:47,270 --> 00:33:50,930 So this could be faster because there's less things to 782 00:33:50,930 --> 00:33:53,780 do in here. 783 00:33:53,780 --> 00:33:57,480 There's another method called RMA, or it's another name you 784 00:33:57,480 --> 00:33:59,180 might see, it's called active messages. 785 00:33:59,180 --> 00:34:02,170 Where you don't ask the receiver. 786 00:34:02,170 --> 00:34:04,995 When you send, you have some pre-assigned place you can go 787 00:34:04,995 --> 00:34:05,750 and dump the data. 788 00:34:05,750 --> 00:34:07,120 So you don't wait for somebody to ask for data. 789 00:34:07,120 --> 00:34:09,585 You said, OK, if I want send, I'll immediately send, and put 790 00:34:09,585 --> 00:34:10,260 it somewhere. 791 00:34:10,260 --> 00:34:14,840 And the data that I send the place in here. 792 00:34:14,840 --> 00:34:19,900 So, the first slide I showed you, all at that time, you saw 793 00:34:19,900 --> 00:34:23,330 all these big difference in the time, either this can 794 00:34:23,330 --> 00:34:26,210 happen, the message can go very fast, or sometime it will 795 00:34:26,210 --> 00:34:27,080 be really slow. 796 00:34:27,080 --> 00:34:30,889 There's a big variation in here. 797 00:34:30,889 --> 00:34:35,920 This happen basically because of the network communications. 798 00:34:35,920 --> 00:34:38,454 How many of you know a little bit about TCP? 799 00:34:41,179 --> 00:34:42,429 OK. 800 00:34:46,860 --> 00:34:49,840 Let me just explain about five minutes of TCP 801 00:34:49,840 --> 00:34:50,889 before I move on. 802 00:34:50,889 --> 00:34:55,600 So TCP is one of the main protocols that we use to 803 00:34:55,600 --> 00:34:56,639 communicate over Internet. 804 00:34:56,639 --> 00:34:59,730 And two things, you want to actually send data. 805 00:34:59,730 --> 00:35:03,100 But also, you want to be a good citizen. 806 00:35:03,100 --> 00:35:07,700 You have actually work in a way that it doesn't really 807 00:35:07,700 --> 00:35:10,320 take over the entire shared bandwidth you have. 808 00:35:10,320 --> 00:35:13,120 So what TCP does, it has a window. 809 00:35:13,120 --> 00:35:16,320 So what it says is, OK, I can send certain amount of data 810 00:35:16,320 --> 00:35:20,780 that's the size of the window, but I can't move beyond that 811 00:35:20,780 --> 00:35:22,640 until I get some acknowledgement. 812 00:35:22,640 --> 00:35:24,460 So I send the window amount of data. 813 00:35:24,460 --> 00:35:26,650 And, on the other side, when it's received that data, I 814 00:35:26,650 --> 00:35:28,600 said, I have seen this much of the window. 815 00:35:28,600 --> 00:35:32,060 And once it's seen this much of the window this send that 816 00:35:32,060 --> 00:35:33,260 acknowledgement back. 817 00:35:33,260 --> 00:35:34,450 And when that acknowledgement comes, you 818 00:35:34,450 --> 00:35:35,670 say, aha, that's good. 819 00:35:35,670 --> 00:35:36,700 That means it has seen that. 820 00:35:36,700 --> 00:35:37,480 Then I can send more. 821 00:35:37,480 --> 00:35:38,630 I keep sending more. 822 00:35:38,630 --> 00:35:41,070 And then, the TCP has this very interesting property. 823 00:35:41,070 --> 00:35:43,410 If you are doing a really good communication, things are 824 00:35:43,410 --> 00:35:46,490 going very nicely, it starts increasing the window size. 825 00:35:46,490 --> 00:35:49,300 It says, oh, OK, that means that windows size keeps going. 826 00:35:49,300 --> 00:35:51,000 I can do bigger and bigger and bigger window size. 827 00:35:51,000 --> 00:35:52,810 You can keep increasing the window size. 828 00:35:52,810 --> 00:35:57,130 And then, what happens at some point, the system get 829 00:35:57,130 --> 00:35:59,200 overloaded, because if you everyone is start-- 830 00:35:59,200 --> 00:36:02,080 increase their window size, too many packets start coming 831 00:36:02,080 --> 00:36:02,720 into the network. 832 00:36:02,720 --> 00:36:04,710 At some point, the network in the middle, doesn't have 833 00:36:04,710 --> 00:36:05,530 enough room. 834 00:36:05,530 --> 00:36:06,780 It drops something. 835 00:36:06,780 --> 00:36:09,720 So, TCP, nice thing about the network is that, even if it 836 00:36:09,720 --> 00:36:11,520 doesn't have a guarantee that it will guarantee it can just 837 00:36:11,520 --> 00:36:12,380 drop something. 838 00:36:12,380 --> 00:36:15,260 And when it drops, what happens is the other guy is 839 00:36:15,260 --> 00:36:15,960 waiting for acknowledgement. 840 00:36:15,960 --> 00:36:18,670 So waiting for data to come, it never shows up. 841 00:36:18,670 --> 00:36:21,840 So then it has this thing, an ack, saying I never got it. 842 00:36:21,840 --> 00:36:25,960 And the problem with that is, so you have a nice bandwidth, 843 00:36:25,960 --> 00:36:28,010 and you get increasing the bandwidth, you get faster and 844 00:36:28,010 --> 00:36:30,730 faster and faster, and suddenly data get missed. 845 00:36:30,730 --> 00:36:33,610 And suddenly you have this big timeout delay. 846 00:36:33,610 --> 00:36:35,060 And then everybody freezes. 847 00:36:35,060 --> 00:36:37,980 And then get the ack, and you restart with a smaller window 848 00:36:37,980 --> 00:36:39,660 and slowly pick up, something like that. 849 00:36:39,660 --> 00:36:42,500 So because of that, there a lot of times what happens, is 850 00:36:42,500 --> 00:36:46,300 this packet get dropped, retransmit happens, so you 851 00:36:46,300 --> 00:36:47,820 have this kind of sawtooth pattern. 852 00:36:47,820 --> 00:36:51,400 Things get faster and faster, things go down for nothing, 853 00:36:51,400 --> 00:36:56,430 for a little while, again start again, in here. 854 00:36:56,430 --> 00:37:01,420 So the other way of communicating is called UDP. 855 00:37:01,420 --> 00:37:04,030 UDP says, OK, if you don't have any kind of 856 00:37:04,030 --> 00:37:06,450 acknowledgement, or something like that, I'll just send. 857 00:37:06,450 --> 00:37:08,870 And you, on the other hand, figure this out whether you 858 00:37:08,870 --> 00:37:10,050 got something or not. 859 00:37:10,050 --> 00:37:11,820 And send information back. 860 00:37:11,820 --> 00:37:15,190 So the network doesn't participate in any kind of a 861 00:37:15,190 --> 00:37:16,750 balancing act of communication. 862 00:37:16,750 --> 00:37:17,830 It's end-to-end. 863 00:37:17,830 --> 00:37:20,220 So of course, you can be a really bad citizen, and say, 864 00:37:20,220 --> 00:37:20,880 OK, I don't care. 865 00:37:20,880 --> 00:37:23,210 I just keep sending huge amount of data and then 866 00:37:23,210 --> 00:37:24,370 somebody would go well. 867 00:37:24,370 --> 00:37:31,300 But, what people have found is for things like video, you can 868 00:37:31,300 --> 00:37:33,650 send UPD and kind of manipulate yourself 869 00:37:33,650 --> 00:37:35,560 end-to-end, can get much better than 870 00:37:35,560 --> 00:37:38,110 trying to do this TCP. 871 00:37:38,110 --> 00:37:40,900 So the kind of thing is, even though there's not 872 00:37:40,900 --> 00:37:45,930 acknowledgment, or there's no real attempt to make sure all 873 00:37:45,930 --> 00:37:49,420 the data goes, UDP sometimes can get better bandwidth, 874 00:37:49,420 --> 00:37:52,540 because it doesn't drop packets in here. 875 00:37:52,540 --> 00:37:55,140 So, there's a lot of great stuff, I mean you guys can 876 00:37:55,140 --> 00:38:01,230 take the networking class and learn all about the protocols 877 00:38:01,230 --> 00:38:01,740 and stuff like that. 878 00:38:01,740 --> 00:38:07,240 There's really, really cool stuff in here, so some of you 879 00:38:07,240 --> 00:38:10,890 might actually, next couple of semesters, learn all about how 880 00:38:10,890 --> 00:38:11,790 these things work. 881 00:38:11,790 --> 00:38:16,010 So I'm just giving you, lot of performance wise, these are 882 00:38:16,010 --> 00:38:18,390 the issues that you are to worry about when you are doing 883 00:38:18,390 --> 00:38:21,040 network level things. 884 00:38:21,040 --> 00:38:21,810 OK. 885 00:38:21,810 --> 00:38:27,170 So that's kind of talks about a little bit about a small 886 00:38:27,170 --> 00:38:31,456 scale, and then if you want to go to next bigger scale-- 887 00:38:31,456 --> 00:38:33,160 why you want to go? 888 00:38:33,160 --> 00:38:35,030 There can be lot more uses. 889 00:38:35,030 --> 00:38:38,790 If you are on something like Facebook, or Amazon, you have 890 00:38:38,790 --> 00:38:41,530 a lot more users to deal with. 891 00:38:41,530 --> 00:38:49,970 If you are, what's a good one with a lot of data? 892 00:38:52,560 --> 00:38:53,510 Google Earth, or something like that. 893 00:38:53,510 --> 00:38:55,040 You have a lot of data. 894 00:38:55,040 --> 00:38:57,050 And you have to deal with all the data, and that's a good 895 00:38:57,050 --> 00:39:00,040 way to do the scale up. 896 00:39:00,040 --> 00:39:02,370 Or you might have huge amount of processing you want to do, 897 00:39:02,370 --> 00:39:05,970 for example, the one place a lot of data and processing is 898 00:39:05,970 --> 00:39:15,230 things like these new basically telescopes that's 899 00:39:15,230 --> 00:39:17,630 coming about, that has arrays of hundreds of different 900 00:39:17,630 --> 00:39:19,880 things, so you have huge amount of data coming from the 901 00:39:19,880 --> 00:39:20,530 telescopes. 902 00:39:20,530 --> 00:39:23,130 And then you could do a huge amount of processing and that. 903 00:39:23,130 --> 00:39:27,140 So that basically, has broad data and processing, and in 904 00:39:27,140 --> 00:39:30,450 things like webs, social networks, and stuff like that, 905 00:39:30,450 --> 00:39:32,070 gives me a lot of data. 906 00:39:32,070 --> 00:39:34,860 So here are some examples of some things like the airline 907 00:39:34,860 --> 00:39:36,050 reservation system. 908 00:39:36,050 --> 00:39:38,200 It's something, all the airlines have to assign 909 00:39:38,200 --> 00:39:41,770 millions of planes, flights, millions of seats 910 00:39:41,770 --> 00:39:42,790 that you deal with. 911 00:39:42,790 --> 00:39:45,850 Things like a stock trading system that all the trades has 912 00:39:45,850 --> 00:39:50,150 to has to come there, and the prices has to get calculated, 913 00:39:50,150 --> 00:39:52,030 and then trades has to get validated. 914 00:39:52,030 --> 00:39:55,500 And, very big analysis, so you form some kind of global 915 00:39:55,500 --> 00:39:57,250 understanding of what's going on. 916 00:39:57,250 --> 00:40:00,290 And I'm going to talk about these two, three things too. 917 00:40:00,290 --> 00:40:02,970 Scene completion and web search, which probably 918 00:40:02,970 --> 00:40:04,260 everybody knows. 919 00:40:04,260 --> 00:40:09,220 So, yes, this kind of data, now, kind of a 920 00:40:09,220 --> 00:40:10,880 web analysis example. 921 00:40:10,880 --> 00:40:16,000 So what these guys were trying to do was, every weekly, troll 922 00:40:16,000 --> 00:40:21,250 151 million web pages, and get about a terabyte of 923 00:40:21,250 --> 00:40:24,960 information, and analyze page statistics. 924 00:40:24,960 --> 00:40:25,840 So that's what they are trying to do. 925 00:40:25,840 --> 00:40:28,430 Some come up with some idea about OK, what 926 00:40:28,430 --> 00:40:29,310 is the world happening? 927 00:40:29,310 --> 00:40:31,540 How did the pages change the last week? 928 00:40:31,540 --> 00:40:34,620 And then try to get a global view of that. 929 00:40:34,620 --> 00:40:37,990 At this point, you have both huge amount of data and pretty 930 00:40:37,990 --> 00:40:40,050 large amount of computation power, that you had to build a 931 00:40:40,050 --> 00:40:41,100 system to do that. 932 00:40:41,100 --> 00:40:43,120 This is where you need a larger system. 933 00:40:43,120 --> 00:40:47,670 Here's another interesting system that people built. 934 00:40:47,670 --> 00:40:51,660 So if you have image here, and the image has this nice 935 00:40:51,660 --> 00:40:53,960 background, there's unfortunate house 936 00:40:53,960 --> 00:40:55,190 sitting in the front. 937 00:40:55,190 --> 00:40:58,710 So what this says is, OK, eliminate the house, search a 938 00:40:58,710 --> 00:41:04,180 very large database to find similar images and plop 939 00:41:04,180 --> 00:41:05,430 something in there. 940 00:41:07,480 --> 00:41:08,650 OK. 941 00:41:08,650 --> 00:41:09,780 So OK. 942 00:41:09,780 --> 00:41:15,170 You can get your face with some nice actual eyes, or 943 00:41:15,170 --> 00:41:15,730 something like that. 944 00:41:15,730 --> 00:41:18,420 Just eliminate all the bad parts, and then and then get 945 00:41:18,420 --> 00:41:20,810 good parts and put them in there. 946 00:41:20,810 --> 00:41:25,780 And so this one, basically, what they'll do was, that's 947 00:41:25,780 --> 00:41:30,780 about 396 gigabytes of images out there. 948 00:41:30,780 --> 00:41:33,640 And so we had to classify images to get the scene 949 00:41:33,640 --> 00:41:37,500 detector, do color similarity, and do context matching. 950 00:41:37,500 --> 00:41:41,470 So computation, what they're doing is about 50 minutes 951 00:41:41,470 --> 00:41:45,300 doing scene matching, 20 minutes of local matching 952 00:41:45,300 --> 00:41:49,340 trying to find right matching, and four minutes composing 953 00:41:49,340 --> 00:41:52,020 there, and then you can parallelize that and reduce 954 00:41:52,020 --> 00:41:54,160 this time to about five minutes. 955 00:41:54,160 --> 00:41:55,890 So here's something that's huge amount of data. 956 00:41:55,890 --> 00:41:58,470 You'll look a lot of things, you do a lot of processing to 957 00:41:58,470 --> 00:42:03,640 figure out we get the right thing and these actually keep 958 00:42:03,640 --> 00:42:06,880 increasing these images as we keep asking for more, more 959 00:42:06,880 --> 00:42:09,270 flexibility, and more accuracy. 960 00:42:09,270 --> 00:42:11,040 Things can get higher and higher. 961 00:42:11,040 --> 00:42:14,740 So really cool application that really require large data 962 00:42:14,740 --> 00:42:15,970 and large processing. 963 00:42:15,970 --> 00:42:18,820 So, of course, the kind of clinical application is 964 00:42:18,820 --> 00:42:19,940 probably Google. 965 00:42:19,940 --> 00:42:22,840 So in this research, you'll get some nice results. 966 00:42:22,840 --> 00:42:25,460 So what people say, is this what two thousand process 967 00:42:25,460 --> 00:42:28,710 involved getting this query for you. 968 00:42:28,710 --> 00:42:33,070 It takes 200 plus terabytes of data, but this is already old 969 00:42:33,070 --> 00:42:35,730 now, this could be even higher now. 970 00:42:35,730 --> 00:42:38,680 And this takes ten to the ten total clock cycles for 971 00:42:38,680 --> 00:42:40,730 everything that needs to happen, for you 972 00:42:40,730 --> 00:42:43,460 to get to your query. 973 00:42:43,460 --> 00:42:45,430 And you only get one sent for the query. 974 00:42:45,430 --> 00:42:47,870 So that, not only are doing it fast, you are doing a lot of 975 00:42:47,870 --> 00:42:49,820 processing, you are doing it very cheap. 976 00:42:49,820 --> 00:42:53,910 And I think one of the biggest things that Google did is 977 00:42:53,910 --> 00:42:57,190 figure how to get that done fast and cheap. 978 00:42:57,190 --> 00:43:02,010 And that's why they so successful. 979 00:43:02,010 --> 00:43:03,670 Oops, sorry, I didn't say this. so it's one second 980 00:43:03,670 --> 00:43:06,740 response time, and the cheapest $0.05 average the 981 00:43:06,740 --> 00:43:08,750 cost, basically. 982 00:43:08,750 --> 00:43:11,740 If you compute a time that's going to cost more than $0.05, 983 00:43:11,740 --> 00:43:12,850 is not worth it. 984 00:43:12,850 --> 00:43:15,090 And you had to do it that. 985 00:43:15,090 --> 00:43:19,000 So, this is Google, spend a lot of time how to figure out, 986 00:43:19,000 --> 00:43:20,250 how to do this is cheaply. 987 00:43:23,110 --> 00:43:27,010 So, this is already validated, but Google is very secretive 988 00:43:27,010 --> 00:43:30,720 of what they do, so this is the closest I can figure out. 989 00:43:30,720 --> 00:43:35,590 They have three million plus processers in clusters of 2000 990 00:43:35,590 --> 00:43:37,660 plus process, each, in each cluster. 991 00:43:37,660 --> 00:43:39,610 And what they already did was they went for 992 00:43:39,610 --> 00:43:41,070 the cheapest thing. 993 00:43:41,070 --> 00:43:43,940 They build entire system out of the cheapest 994 00:43:43,940 --> 00:43:45,095 parts we can get. 995 00:43:45,095 --> 00:43:48,040 x86 processors, the cheapest disks, fairly cheap 996 00:43:48,040 --> 00:43:51,670 communication, and gain reliability, 997 00:43:51,670 --> 00:43:53,340 redundancy though software. 998 00:43:53,340 --> 00:43:56,940 So each part, I mean supposing in Google, this data center, 999 00:43:56,940 --> 00:43:59,220 there's somebody who's constantly growing and 1000 00:43:59,220 --> 00:44:00,980 changing machines and changing disks. 1001 00:44:00,980 --> 00:44:03,700 Because there's so much failure. 1002 00:44:03,700 --> 00:44:06,800 But that means we have to have the software system to keep 1003 00:44:06,800 --> 00:44:08,880 the things running in there. 1004 00:44:08,880 --> 00:44:11,520 And what they have is a partitioned workload, all 1005 00:44:11,520 --> 00:44:13,580 those things are nicely partitioned and distributed 1006 00:44:13,580 --> 00:44:18,110 through Google as this nice file system and stuff do that. 1007 00:44:18,110 --> 00:44:20,160 And then you have to do crawling, index generation, 1008 00:44:20,160 --> 00:44:22,850 index search, document retrieval, ad placement, all 1009 00:44:22,850 --> 00:44:24,350 those things happen in there. 1010 00:44:24,350 --> 00:44:27,210 Of course, other things like Microsoft and Yahoo, and all 1011 00:44:27,210 --> 00:44:29,690 those other people have systems like that. 1012 00:44:29,690 --> 00:44:34,030 So this is kind of what, when you go in here to scale up, 1013 00:44:34,030 --> 00:44:35,940 there's no other way, you have to actually build this huge 1014 00:44:35,940 --> 00:44:37,490 system to do that. 1015 00:44:37,490 --> 00:44:40,575 So one thing Google does, going a little bit technical, 1016 00:44:40,575 --> 00:44:43,040 is this have a system called MapReduce. 1017 00:44:43,040 --> 00:44:45,150 How many of you have seen, heard of MapReduce? 1018 00:44:45,150 --> 00:44:45,610 OK. 1019 00:44:45,610 --> 00:44:47,630 So there's all this people who know MapReduce. 1020 00:44:47,630 --> 00:44:49,870 Probably more than I do. 1021 00:44:49,870 --> 00:44:53,980 So the idea there is you have a bunch of data, a huge amount 1022 00:44:53,980 --> 00:44:56,660 of data in here. 1023 00:44:56,660 --> 00:44:59,220 And, normally, what you have to do is find some 1024 00:44:59,220 --> 00:45:01,640 similarities in lot of data, and do some 1025 00:45:01,640 --> 00:45:02,780 processing for that. 1026 00:45:02,780 --> 00:45:08,750 And this is programming model set up nicely help doing that. 1027 00:45:08,750 --> 00:45:11,800 So, that this borrows lot of functional programming. 1028 00:45:11,800 --> 00:45:15,710 What that means is I'm not changing data, I'm always 1029 00:45:15,710 --> 00:45:18,730 taking some data values and creating something new. 1030 00:45:18,730 --> 00:45:21,460 I'm never changing something existing, that's basically 1031 00:45:21,460 --> 00:45:23,220 meaning of a functional program. 1032 00:45:23,220 --> 00:45:24,900 So MapReduce has two components. 1033 00:45:24,900 --> 00:45:26,240 First the map. 1034 00:45:26,240 --> 00:45:33,000 That means given some input value and a key in there, what 1035 00:45:33,000 --> 00:45:36,690 you develop generate is some intermediate results and 1036 00:45:36,690 --> 00:45:39,670 output key. 1037 00:45:39,670 --> 00:45:43,180 You get bunch of values coming through, and everybody process 1038 00:45:43,180 --> 00:45:43,930 each one as separate. 1039 00:45:43,930 --> 00:45:44,370 And say, OK. 1040 00:45:44,370 --> 00:45:46,370 So here is the output key, and here's some 1041 00:45:46,370 --> 00:45:47,550 intermediate value. 1042 00:45:47,550 --> 00:45:51,700 And then what you do is things with the same output key gets 1043 00:45:51,700 --> 00:45:54,010 sorted into one list. 1044 00:45:54,010 --> 00:45:56,140 And then it's going reduce it. 1045 00:45:56,140 --> 00:45:59,480 And the reducer takes the output key in this list, and 1046 00:45:59,480 --> 00:46:02,420 say, OK, look I'm going to process the entire list down 1047 00:46:02,420 --> 00:46:06,110 to one element or small data item. 1048 00:46:06,110 --> 00:46:06,700 OK? 1049 00:46:06,700 --> 00:46:09,140 So let's go through a little bit more, 1050 00:46:09,140 --> 00:46:10,550 digging deep into that. 1051 00:46:10,550 --> 00:46:18,200 And so you map, basically get a huge amount of records from 1052 00:46:18,200 --> 00:46:25,360 the data source, and it fits into this map function, and it 1053 00:46:25,360 --> 00:46:28,250 produce intermediate results. 1054 00:46:28,250 --> 00:46:31,250 And the reduced function, basically, combines the data, 1055 00:46:31,250 --> 00:46:35,060 and all the folding-- 1056 00:46:35,060 --> 00:46:36,030 let me give you an example. 1057 00:46:36,030 --> 00:46:37,190 I think that will show you better. 1058 00:46:37,190 --> 00:46:38,730 So here is kind of architecture. 1059 00:46:38,730 --> 00:46:40,590 So you have a huge amount of data resources. 1060 00:46:40,590 --> 00:46:43,380 You have many, many sources in here. 1061 00:46:43,380 --> 00:46:46,010 And each of the data comes in to that, and the map will 1062 00:46:46,010 --> 00:46:48,430 basically distributed by keys and values, so there could be 1063 00:46:48,430 --> 00:46:49,960 millions and values. 1064 00:46:49,960 --> 00:46:53,480 And then, what you have to do is, wait until all the data, 1065 00:46:53,480 --> 00:46:54,770 has done that. 1066 00:46:54,770 --> 00:46:57,560 And then cleared for the number of keys 1067 00:46:57,560 --> 00:46:59,620 here, number of reducers. 1068 00:46:59,620 --> 00:47:01,360 So hopefully you wont have a lot of keys. 1069 00:47:01,360 --> 00:47:03,170 If you have more than two keys, you don't get that 1070 00:47:03,170 --> 00:47:05,940 parallelism because then you would be too huge lists. 1071 00:47:05,940 --> 00:47:09,780 And then, again, what happens is these keys get paired to 1072 00:47:09,780 --> 00:47:13,660 reducers to come the final value in here. 1073 00:47:13,660 --> 00:47:15,100 So what's the parallelism here? 1074 00:47:15,100 --> 00:47:17,780 What makes the parallelism go high? 1075 00:47:17,780 --> 00:47:20,090 Or, not have enough parallelism? 1076 00:47:23,290 --> 00:47:25,810 Yeah, I mean, first of all, you need to have enough, 1077 00:47:25,810 --> 00:47:28,670 hopefully, multiple data stores so you get a lot of 1078 00:47:28,670 --> 00:47:29,710 parallelism coming in here. 1079 00:47:29,710 --> 00:47:33,660 Map is easily parallelizable, because each choosing in here. 1080 00:47:33,660 --> 00:47:35,820 Reducer is the problem I think one. 1081 00:47:35,820 --> 00:47:38,280 Because if you have too many keys, too little keys, you are 1082 00:47:38,280 --> 00:47:39,820 in trouble. 1083 00:47:39,820 --> 00:47:42,000 The other interesting thing in here is there's a big 1084 00:47:42,000 --> 00:47:43,990 shuffling between here. 1085 00:47:43,990 --> 00:47:46,470 So that means data has to go all over the place. 1086 00:47:46,470 --> 00:47:49,590 So it's not something that, you got data and you process 1087 00:47:49,590 --> 00:47:52,040 to the end you got data you process to end, every data has 1088 00:47:52,040 --> 00:47:54,210 to kind of cross back, and so that's a huge amount of 1089 00:47:54,210 --> 00:47:56,360 communication in here that could be bottle-necked too. 1090 00:47:56,360 --> 00:48:00,200 That can be bottle-necked, keys can be bottle-necked. 1091 00:48:00,200 --> 00:48:03,160 So map function runs parallel, creating different things. 1092 00:48:03,160 --> 00:48:06,040 Reduced functions also run parallel for each key. 1093 00:48:06,040 --> 00:48:08,470 And all values are basically processed independently 1094 00:48:08,470 --> 00:48:10,860 because of that. 1095 00:48:10,860 --> 00:48:13,640 Also, the bottle-neck is reduce phase can't start until 1096 00:48:13,640 --> 00:48:15,360 all the map is done, and also all the data gets shuffled 1097 00:48:15,360 --> 00:48:16,830 around with that. 1098 00:48:16,830 --> 00:48:18,050 So here's an interesting example. 1099 00:48:18,050 --> 00:48:20,630 What I am trying to do is I am trying to count the number of 1100 00:48:20,630 --> 00:48:25,450 words in assume huge amount of web pages. 1101 00:48:25,450 --> 00:48:30,800 So what I can do is in the map, I get each page in here-- 1102 00:48:30,800 --> 00:48:35,660 I thread through the page emitting each word as my key 1103 00:48:35,660 --> 00:48:37,230 and the count as one. 1104 00:48:37,230 --> 00:48:40,020 Because I only get one thing. 1105 00:48:40,020 --> 00:48:42,930 And then my reducer is basically-- 1106 00:48:42,930 --> 00:48:44,770 my key is each word. 1107 00:48:44,770 --> 00:48:48,050 So if I have a million words, I can have a million reducers. 1108 00:48:48,050 --> 00:48:50,580 And the reducer basically takes all those things-- it's 1109 00:48:50,580 --> 00:48:52,250 not that fun because everything is 1110 00:48:52,250 --> 00:48:53,290 all at number one. 1111 00:48:53,290 --> 00:48:54,550 Because we count at one. 1112 00:48:54,550 --> 00:48:58,410 And then basically keep adding up how many things for each 1113 00:48:58,410 --> 00:49:01,200 word came about and put up the results. 1114 00:49:01,200 --> 00:49:04,310 So you can say, OK, look, for the entire corpus of data, I 1115 00:49:04,310 --> 00:49:08,270 had this many words count, this many word occurrences, 1116 00:49:08,270 --> 00:49:09,990 this many all for each word. 1117 00:49:09,990 --> 00:49:12,160 You get a word count. 1118 00:49:12,160 --> 00:49:16,960 So basically trying to create a histogram here and 1119 00:49:16,960 --> 00:49:20,500 MapReducer provides a very nice interface to do that. 1120 00:49:20,500 --> 00:49:24,560 And it's very nice, high level, and it provides this 1121 00:49:24,560 --> 00:49:32,200 nice infrastructure to run this in parallel in here and 1122 00:49:32,200 --> 00:49:37,230 do all the communication necessary, figure out how many 1123 00:49:37,230 --> 00:49:40,040 reducers to run, look at machines to run them, produce 1124 00:49:40,040 --> 00:49:41,780 the result, and give you the result. 1125 00:49:41,780 --> 00:49:43,372 So this is a nice infrastructure Google has 1126 00:49:43,372 --> 00:49:44,622 built in there. 1127 00:49:48,470 --> 00:49:51,390 So in this level, when you go to this-- 1128 00:49:51,390 --> 00:49:53,130 this is the data center level. 1129 00:49:53,130 --> 00:49:55,790 What do you have to do to scale? 1130 00:49:55,790 --> 00:49:58,030 You need to distribute data. 1131 00:49:58,030 --> 00:50:01,290 And you need to parallelize because if all the data is in 1132 00:50:01,290 --> 00:50:02,730 one machine, it doesn't help. 1133 00:50:02,730 --> 00:50:04,580 And you need to have parallelism to scale 1134 00:50:04,580 --> 00:50:06,150 everything. 1135 00:50:06,150 --> 00:50:10,040 Another interesting thing you can do is approximate. 1136 00:50:10,040 --> 00:50:15,090 So what that means is normally when you calculate, when 1137 00:50:15,090 --> 00:50:19,260 everybody has exactly the same data all the time-- because 1138 00:50:19,260 --> 00:50:21,520 when you write the memory, everybody sees that memory-- 1139 00:50:21,520 --> 00:50:24,140 you have the perfect knowledge of the word. 1140 00:50:24,140 --> 00:50:26,810 And in a distributed system, getting perfect knowledge is 1141 00:50:26,810 --> 00:50:27,340 very expensive. 1142 00:50:27,340 --> 00:50:29,425 That means every time something changes, you have to 1143 00:50:29,425 --> 00:50:31,130 send everybody that data. 1144 00:50:31,130 --> 00:50:35,020 And one way that people really make these systems run fast, 1145 00:50:35,020 --> 00:50:37,145 you say, wait a minute, if somebody doesn't have the 1146 00:50:37,145 --> 00:50:39,930 perfect knowledge, if there's a little bit of discrepancy 1147 00:50:39,930 --> 00:50:43,860 between something, I am OK. 1148 00:50:43,860 --> 00:50:45,772 Assume you have a new-- 1149 00:50:45,772 --> 00:50:48,970 you changed your-- 1150 00:50:48,970 --> 00:50:49,720 we'll say-- 1151 00:50:49,720 --> 00:50:53,320 web page and added a couple of new words in there. 1152 00:50:53,320 --> 00:50:55,970 Next second, the search doesn't see it. 1153 00:50:55,970 --> 00:50:57,630 Nobody's going to complain. 1154 00:50:57,630 --> 00:51:00,980 And then you can deliver that. 1155 00:51:00,980 --> 00:51:02,690 If you do a search, you'll find something. 1156 00:51:02,690 --> 00:51:05,030 But somebody else doesn't do it because that data haven't 1157 00:51:05,030 --> 00:51:07,350 propagated that to both of the things. 1158 00:51:07,350 --> 00:51:09,270 Nobody's going to complain and say, wait a minute, I found 1159 00:51:09,270 --> 00:51:11,210 it, but he didn't. 1160 00:51:11,210 --> 00:51:13,140 It can have a little bit of a lag. 1161 00:51:13,140 --> 00:51:15,900 And that can be really, really useful in 1162 00:51:15,900 --> 00:51:16,880 these kind of systems. 1163 00:51:16,880 --> 00:51:19,492 Because every time something happened, you don't have to 1164 00:51:19,492 --> 00:51:20,830 keep updating. 1165 00:51:20,830 --> 00:51:23,960 But tell me a system that you can't actually do that. 1166 00:51:26,550 --> 00:51:31,972 Play it a little bit fast and easy. 1167 00:51:31,972 --> 00:51:33,210 AUDIENCE: Stock trading. 1168 00:51:33,210 --> 00:51:33,560 PROFESSOR: Stock trading. 1169 00:51:33,560 --> 00:51:36,940 Yeah, that's something basically, if you say, yeah, 1170 00:51:36,940 --> 00:51:38,910 you might get it too, you might get it-- 1171 00:51:38,910 --> 00:51:40,680 and that doesn't work. 1172 00:51:40,680 --> 00:51:44,130 Basically, stock trading has this very particular thing 1173 00:51:44,130 --> 00:51:50,195 because when it does submit, we'll say, a sale order, 1174 00:51:50,195 --> 00:51:53,410 within a certain amount of time, it has to get matched up 1175 00:51:53,410 --> 00:51:56,860 and has to be announced to both people. 1176 00:51:56,860 --> 00:51:59,800 And also, there are a lot of other constraints like if the 1177 00:51:59,800 --> 00:52:01,600 machine goes down. 1178 00:52:01,600 --> 00:52:04,760 Either trade has to be everybody saw the trade or 1179 00:52:04,760 --> 00:52:06,280 nobody saw it. 1180 00:52:06,280 --> 00:52:08,050 You can't say-- somebody says, I sold it. 1181 00:52:08,050 --> 00:52:11,520 And another guy says, no, I didn't buy it. 1182 00:52:11,520 --> 00:52:13,460 And when you have millions of billions of dollars back and 1183 00:52:13,460 --> 00:52:15,400 forth, that doesn't really work. 1184 00:52:15,400 --> 00:52:20,250 So for that, there's this thing called transactions. 1185 00:52:20,250 --> 00:52:23,050 So transactions is an interesting way-- a lot of 1186 00:52:23,050 --> 00:52:24,450 databases have this transaction. 1187 00:52:24,450 --> 00:52:26,050 Transactions say, look, I am doing this 1188 00:52:26,050 --> 00:52:27,300 very complicated thing. 1189 00:52:31,500 --> 00:52:34,940 And I cannot have this intermediate state going on. 1190 00:52:34,940 --> 00:52:38,720 So what transactions say is, first, tell me everything I 1191 00:52:38,720 --> 00:52:40,280 want to do in the transaction. 1192 00:52:40,280 --> 00:52:42,400 So it might be I want to sell a stock, I want to buy a 1193 00:52:42,400 --> 00:52:43,660 stock, whatever. 1194 00:52:43,660 --> 00:52:48,250 And then at some point when you commit the transaction, 1195 00:52:48,250 --> 00:52:51,250 you either say, OK, everything worked, good, the entire thing 1196 00:52:51,250 --> 00:52:52,030 gets committed. 1197 00:52:52,030 --> 00:52:52,960 And you are done. 1198 00:52:52,960 --> 00:52:57,170 Or it can explicitly reject it. 1199 00:52:57,170 --> 00:52:59,210 It can come back and say, look, I can't do this 1200 00:52:59,210 --> 00:52:59,580 transaction. 1201 00:52:59,580 --> 00:53:01,230 Now sorry, you can restart it. 1202 00:53:01,230 --> 00:53:02,960 So you can accept and reject. 1203 00:53:02,960 --> 00:53:06,690 But then the nice thing about that is then every one 1204 00:53:06,690 --> 00:53:08,020 single-- it's like atomicity. 1205 00:53:08,020 --> 00:53:12,210 Every one action doesn't have to happen immediately or 1206 00:53:12,210 --> 00:53:12,960 happen as a group. 1207 00:53:12,960 --> 00:53:14,590 You can say, OK, I'm doing a bunch of action in the 1208 00:53:14,590 --> 00:53:15,150 transaction. 1209 00:53:15,150 --> 00:53:19,400 And then finally, I can come in and if it works, great. 1210 00:53:19,400 --> 00:53:21,950 So what might be a reason you might not be able to commit a 1211 00:53:21,950 --> 00:53:23,245 transaction if you do a transaction? 1212 00:53:30,695 --> 00:53:32,270 Anybody else want to answer? 1213 00:53:32,270 --> 00:53:34,140 When you say, I want a transaction, I want to commit 1214 00:53:34,140 --> 00:53:35,710 something, what might say-- 1215 00:53:38,250 --> 00:53:39,450 so here's an interesting thing. 1216 00:53:39,450 --> 00:53:42,220 In the stock trading type world-- 1217 00:53:42,220 --> 00:53:43,990 so assume I want to sell something. 1218 00:53:43,990 --> 00:53:45,590 Let's look at the airline reservation. 1219 00:53:45,590 --> 00:53:48,690 So assume I have an airline seat in here. 1220 00:53:48,690 --> 00:53:54,280 And if two people want to try to resell that seat, I can do 1221 00:53:54,280 --> 00:53:57,380 all this processing in parallel for everybody until I 1222 00:53:57,380 --> 00:53:59,750 come to the commit point. 1223 00:53:59,750 --> 00:54:02,230 That means I can look at everybody after you enter the 1224 00:54:02,230 --> 00:54:06,580 data, do the price, all those things separately 1225 00:54:06,580 --> 00:54:08,460 for the same seat. 1226 00:54:08,460 --> 00:54:10,860 But then when you come to the commit point, you say, can I 1227 00:54:10,860 --> 00:54:11,750 commit the transaction? 1228 00:54:11,750 --> 00:54:14,310 At that point, only at that point, they have to figure out 1229 00:54:14,310 --> 00:54:16,590 whether there's a conflict in here. 1230 00:54:16,590 --> 00:54:17,770 And at some point, if there's a conflict, 1231 00:54:17,770 --> 00:54:19,310 it says, oops, can't. 1232 00:54:19,310 --> 00:54:21,110 One transaction has to get aborted. 1233 00:54:21,110 --> 00:54:23,675 The nice thing about that is most of the time people are 1234 00:54:23,675 --> 00:54:25,000 not going to fight for the same seat. 1235 00:54:25,000 --> 00:54:27,380 And then things can proceed in parallel. 1236 00:54:27,380 --> 00:54:28,310 You don't have to wait. 1237 00:54:28,310 --> 00:54:31,050 Otherwise, if you do that, there might only one seat 1238 00:54:31,050 --> 00:54:32,330 assignment at a time you can do. 1239 00:54:32,330 --> 00:54:34,120 And that's really not going to scale. 1240 00:54:34,120 --> 00:54:35,750 So everybody tried to get their seat. 1241 00:54:35,750 --> 00:54:37,590 They go to the end, and they say, can I proceed? 1242 00:54:37,590 --> 00:54:40,090 And at that point, you check whether there's a conflict. 1243 00:54:40,090 --> 00:54:42,130 And if there's a conflict, one guy backs out. 1244 00:54:42,130 --> 00:54:43,435 So that's the transaction. 1245 00:54:43,435 --> 00:54:45,610 Oops, I'm going to get rebooted I guess. 1246 00:54:49,400 --> 00:54:56,050 So when you go to planet scale, you can get even into 1247 00:54:56,050 --> 00:54:57,540 more issues, things like-- 1248 00:54:57,540 --> 00:54:59,550 what could be a planet scale thing out there? 1249 00:55:05,190 --> 00:55:06,870 What's an interesting planet scale thing that 1250 00:55:06,870 --> 00:55:08,120 you can think of? 1251 00:55:10,690 --> 00:55:14,240 Single computation that has to happen in the planet scale. 1252 00:55:21,300 --> 00:55:23,920 Something like Internet naming system. 1253 00:55:23,920 --> 00:55:25,820 It has to work everywhere in the entire planet. 1254 00:55:25,820 --> 00:55:27,640 Or something like Internet routing. 1255 00:55:27,640 --> 00:55:30,050 There has to be an algorithm that has to work. 1256 00:55:30,050 --> 00:55:33,260 The entire world has to cooperate and then make sure 1257 00:55:33,260 --> 00:55:35,730 that all the traffic actually goes to the right place. 1258 00:55:35,730 --> 00:55:37,880 So there's a lot more issues, interesting 1259 00:55:37,880 --> 00:55:40,610 things show up in here. 1260 00:55:40,610 --> 00:55:44,850 So things like Seti@Home type stuff-- 1261 00:55:44,850 --> 00:55:48,290 these are a little bit dated these days, that happens-- 1262 00:55:48,290 --> 00:55:50,160 distributed all across the place. 1263 00:55:50,160 --> 00:55:55,520 So if you do planet scale, it has to be truly distributed. 1264 00:55:55,520 --> 00:55:57,780 There cannot be any global operations, no single 1265 00:55:57,780 --> 00:55:58,950 bottleneck. 1266 00:55:58,950 --> 00:56:00,795 And you have to have distributed 1267 00:56:00,795 --> 00:56:02,510 view with stale data. 1268 00:56:02,510 --> 00:56:04,580 You cannot say, look, everybody has to 1269 00:56:04,580 --> 00:56:05,850 have the same data. 1270 00:56:05,850 --> 00:56:07,870 You have to have everything distributed. 1271 00:56:07,870 --> 00:56:10,880 And it has to add up to load distributions because things 1272 00:56:10,880 --> 00:56:13,800 can keep changing in there. 1273 00:56:13,800 --> 00:56:17,760 So what I'm going to do next is trying to give you a little 1274 00:56:17,760 --> 00:56:22,530 bit of a case study that shows you some interesting 1275 00:56:22,530 --> 00:56:24,580 properties that show up when you start 1276 00:56:24,580 --> 00:56:26,230 building at that scale. 1277 00:56:26,230 --> 00:56:30,590 And this has some planet scale type properties, some cluster 1278 00:56:30,590 --> 00:56:31,820 properties, whatever. 1279 00:56:31,820 --> 00:56:33,750 And I will probably first describe this interesting 1280 00:56:33,750 --> 00:56:37,810 problem and then show what kind of solutions that came 1281 00:56:37,810 --> 00:56:41,740 through, so to give you a perspective for 1282 00:56:41,740 --> 00:56:43,630 a problem in here. 1283 00:56:43,630 --> 00:56:45,860 Any questions up to this far for distributed systems? 1284 00:56:48,490 --> 00:56:51,020 It's hard to do distributed systems in one lecture. 1285 00:56:51,020 --> 00:56:52,860 There are almost closest for distributed systems. 1286 00:56:52,860 --> 00:56:58,890 But this will give you a feel for some of it. 1287 00:56:58,890 --> 00:57:03,180 So the case study here is from VMware. 1288 00:57:03,180 --> 00:57:07,500 It's called deduplication at global space. 1289 00:57:07,500 --> 00:57:11,010 And the problem shows up when you're trying to move virtual 1290 00:57:11,010 --> 00:57:13,330 machines across the world. 1291 00:57:13,330 --> 00:57:14,740 You have this virtual machine. 1292 00:57:14,740 --> 00:57:18,310 So what virtualization did was it took a piece of hardware 1293 00:57:18,310 --> 00:57:20,690 and it converted it into a file. 1294 00:57:20,690 --> 00:57:23,460 So each machine is now a file. 1295 00:57:23,460 --> 00:57:25,290 When you have a file, like hardware, there are a lot of 1296 00:57:25,290 --> 00:57:27,390 cool things you can do then. 1297 00:57:27,390 --> 00:57:28,880 You can replicate those files. 1298 00:57:28,880 --> 00:57:31,640 So suddenly, instead of one machine, you've got tens of 1299 00:57:31,640 --> 00:57:32,930 hundreds of machines. 1300 00:57:32,930 --> 00:57:34,600 You can move those things in here. 1301 00:57:34,600 --> 00:57:38,340 And of course, you can start another machines all over. 1302 00:57:38,340 --> 00:57:40,970 So once you are able to move these things, the issue 1303 00:57:40,970 --> 00:57:43,150 becomes how to move those things around and what's the 1304 00:57:43,150 --> 00:57:44,400 cost of moving something. 1305 00:57:46,830 --> 00:57:48,360 And also, you can store it, store 1306 00:57:48,360 --> 00:57:50,070 those things in a database. 1307 00:57:54,800 --> 00:57:56,260 The interesting thing that's happening these 1308 00:57:56,260 --> 00:57:58,180 days is cloud computing. 1309 00:57:58,180 --> 00:58:01,485 Cloud means there's all these providers all over the place, 1310 00:58:01,485 --> 00:58:02,970 saying, I have processing power, I can 1311 00:58:02,970 --> 00:58:03,700 give you some of them. 1312 00:58:03,700 --> 00:58:06,550 Amazon does something easy too, but Verizon is trying to 1313 00:58:06,550 --> 00:58:09,680 do, everybody is trying to do that. 1314 00:58:09,680 --> 00:58:13,940 So if you want to have the best market, what you want to 1315 00:58:13,940 --> 00:58:16,780 do is have the elasticity to move from cloud to cloud for 1316 00:58:16,780 --> 00:58:18,570 many reasons. 1317 00:58:18,570 --> 00:58:20,600 So sometimes the cloud might be too small. 1318 00:58:20,600 --> 00:58:22,510 You want to get to a bigger cloud in there. 1319 00:58:22,510 --> 00:58:25,120 Or you want to be near the users. 1320 00:58:25,120 --> 00:58:28,350 So in the daytime in the US, you want probably to move the 1321 00:58:28,350 --> 00:58:29,410 machines through US. 1322 00:58:29,410 --> 00:58:31,930 At night, there might be users in China, so you want to move 1323 00:58:31,930 --> 00:58:32,940 your compute nearer China. 1324 00:58:32,940 --> 00:58:35,100 Because it will be closer to the people who are using it. 1325 00:58:35,100 --> 00:58:36,960 So something like that, you can move around there. 1326 00:58:36,960 --> 00:58:38,170 Or you want to find the cheaper provider. 1327 00:58:38,170 --> 00:58:40,280 If somebody comes and says, look, I can give you compute 1328 00:58:40,280 --> 00:58:45,210 power $0.10 cheaper than what you are getting, OK, I want to 1329 00:58:45,210 --> 00:58:46,880 move to that guy. 1330 00:58:46,880 --> 00:58:49,890 And also, to amortize the risk of catastrophic failure. 1331 00:58:49,890 --> 00:58:52,710 If there's a hurricane approaching somewhere, I might 1332 00:58:52,710 --> 00:58:54,860 want to move to a data center that might be out of the way. 1333 00:58:54,860 --> 00:58:56,370 And I want to do that. 1334 00:58:56,370 --> 00:59:00,956 And the interesting thing there is a lot of things. 1335 00:59:00,956 --> 00:59:06,860 But when you say, application in the cloud, it's a machine, 1336 00:59:06,860 --> 00:59:09,440 a machine basically in a virtual machine. 1337 00:59:09,440 --> 00:59:11,160 At the same time, a virtual machine has to get moved 1338 00:59:11,160 --> 00:59:13,110 around, not your small application. 1339 00:59:13,110 --> 00:59:15,010 The entire thing has to move around. 1340 00:59:15,010 --> 00:59:17,290 And virtual machines are hefty. 1341 00:59:17,290 --> 00:59:18,630 Because it has an operating system, 1342 00:59:18,630 --> 00:59:21,490 it has all the software. 1343 00:59:21,490 --> 00:59:22,520 There's so many things now. 1344 00:59:22,520 --> 00:59:24,160 Then your data and your state. 1345 00:59:24,160 --> 00:59:26,540 There are a lot of things in the machine in here. 1346 00:59:26,540 --> 00:59:28,360 And all those things have to get moved around, so that can 1347 00:59:28,360 --> 00:59:29,900 be expensive. 1348 00:59:29,900 --> 00:59:33,100 So yes, interesting experiment in here. 1349 00:59:33,100 --> 00:59:36,530 So the idea here is to try to move something from Boston to 1350 00:59:36,530 --> 00:59:38,870 Palo Alto on a 2 megabytes network. 1351 00:59:38,870 --> 00:59:42,320 And there are a bunch of different virtual 1352 00:59:42,320 --> 00:59:44,130 machines in here, VMs. 1353 00:59:44,130 --> 00:59:46,860 And it takes-- 1354 00:59:46,860 --> 00:59:47,270 whatever-- 1355 00:59:47,270 --> 00:59:52,670 3,000 minutes to move these machines from-- 1356 00:59:52,670 --> 00:59:55,050 this is 500 minutes. 1357 00:59:55,050 --> 00:59:56,300 That means what? 1358 01:00:01,050 --> 01:00:02,000 Some hours to move them. 1359 01:00:02,000 --> 01:00:04,820 Some hours to move these machines around. 1360 01:00:04,820 --> 01:00:08,800 And then what you say, look, the machines are heavy, big. 1361 01:00:08,800 --> 01:00:11,010 Why can't you first compress the machine? 1362 01:00:11,010 --> 01:00:13,450 So you can use something like normal compression. 1363 01:00:13,450 --> 01:00:17,250 So blue is basically compress, move the machine, and 1364 01:00:17,250 --> 01:00:18,810 decompress. 1365 01:00:18,810 --> 01:00:19,950 So here is something interesting. 1366 01:00:19,950 --> 01:00:24,380 This is a very fast compression. 1367 01:00:24,380 --> 01:00:26,380 You did really well, you are moving there. 1368 01:00:26,380 --> 01:00:28,290 If you want a better compression, you say, I'm 1369 01:00:28,290 --> 01:00:31,760 going to do a full, best compression I can do, it's 1370 01:00:31,760 --> 01:00:32,550 actually slower. 1371 01:00:32,550 --> 01:00:35,360 Because the trouble is the compression time is so high, 1372 01:00:35,360 --> 01:00:38,570 the reduction is not usable in here. 1373 01:00:38,570 --> 01:00:40,340 So you try to compress, you spend most of the time 1374 01:00:40,340 --> 01:00:41,220 compressing. 1375 01:00:41,220 --> 01:00:43,390 So actually, this is even slower that just sending 1376 01:00:43,390 --> 01:00:44,370 without compression. 1377 01:00:44,370 --> 01:00:46,310 So compression is important, compression is useful. 1378 01:00:46,310 --> 01:00:48,750 So can you do better than a normal compression in here? 1379 01:00:48,750 --> 01:00:51,070 How can you do better? 1380 01:00:51,070 --> 01:00:54,146 So some key observations in here. 1381 01:00:54,146 --> 01:00:57,410 So a large part of these files are executables. 1382 01:00:57,410 --> 01:00:59,660 You have your Linux kernel, whatever, to all those 1383 01:00:59,660 --> 01:01:01,230 executables hitting in there. 1384 01:01:01,230 --> 01:01:07,460 And basically, that's monoculturing the world. 1385 01:01:07,460 --> 01:01:11,000 There are no million different executables. 1386 01:01:11,000 --> 01:01:13,030 You are a Linux kernel, there's only a certain amount 1387 01:01:13,030 --> 01:01:14,830 of versions. 1388 01:01:14,830 --> 01:01:17,360 Microsoft XP, there are certain types of versions. 1389 01:01:17,360 --> 01:01:20,335 So even though there are millions of machines, inside 1390 01:01:20,335 --> 01:01:22,270 the millions of machines, there aren't millions of 1391 01:01:22,270 --> 01:01:23,020 different applications. 1392 01:01:23,020 --> 01:01:25,350 There's only hundreds of different applications. 1393 01:01:25,350 --> 01:01:26,420 And so your motion moves. 1394 01:01:26,420 --> 01:01:28,290 If you think about it, you are moving the same thing again 1395 01:01:28,290 --> 01:01:29,170 and again and again. 1396 01:01:29,170 --> 01:01:31,310 Can you take advantage of that? 1397 01:01:31,310 --> 01:01:35,720 And there's even substantial redundancy in each of these. 1398 01:01:35,720 --> 01:01:37,190 So this is very interesting. 1399 01:01:37,190 --> 01:01:40,430 If you have a Windows machine, each DLL has three copies. 1400 01:01:40,430 --> 01:01:43,990 So you have the copy, and then the Installer has a copy. 1401 01:01:43,990 --> 01:01:46,650 And then there's another copy in the next version to 1402 01:01:46,650 --> 01:01:49,050 basically back out, so undo copy. 1403 01:01:49,050 --> 01:01:50,320 So each thing is kept three copies. 1404 01:01:50,320 --> 01:01:52,520 So every big thing, they're seeing 1405 01:01:52,520 --> 01:01:53,410 multiple copies in there. 1406 01:01:53,410 --> 01:01:54,610 So that part is there also. 1407 01:01:54,610 --> 01:01:58,420 Even within a single disk, there is redundancy in here. 1408 01:01:58,420 --> 01:02:03,250 And another interesting thing is many of the disks have a 1409 01:02:03,250 --> 01:02:04,630 large amount of zero pages. 1410 01:02:04,630 --> 01:02:07,230 So if you send something uncompressed, you send a huge 1411 01:02:07,230 --> 01:02:08,030 amount of zeros. 1412 01:02:08,030 --> 01:02:11,120 So you are waiting for zeros to get in there. 1413 01:02:11,120 --> 01:02:13,875 Even easy compression can get to those zeros, but this is a 1414 01:02:13,875 --> 01:02:15,700 large chunk of data in here. 1415 01:02:15,700 --> 01:02:19,560 And so the interesting thing is if you take one virtual 1416 01:02:19,560 --> 01:02:22,680 machine, this is the number of non-zero blocks. 1417 01:02:22,680 --> 01:02:24,290 And this is the number of unique blocks. 1418 01:02:24,290 --> 01:02:27,150 So unique blocks are smaller than non-zero blocks. 1419 01:02:27,150 --> 01:02:30,510 But if you keep adding more and more virtual machines, 1420 01:02:30,510 --> 01:02:33,520 then of course, the number of total blocks keeps going up. 1421 01:02:33,520 --> 01:02:37,570 But the unique blocks doesn't keep increasing. 1422 01:02:37,570 --> 01:02:42,720 That means the second Linux box you add, there's not much 1423 01:02:42,720 --> 01:02:43,600 new in there. 1424 01:02:43,600 --> 01:02:46,310 So if you look at that, what happens is the first guy has 1425 01:02:46,310 --> 01:02:48,810 about 80% things are unique. 1426 01:02:48,810 --> 01:02:52,140 When you keep adding things, it's about only 30% is unique 1427 01:02:52,140 --> 01:02:52,570 after you add. 1428 01:02:52,570 --> 01:02:53,770 Because it's the same program. 1429 01:02:53,770 --> 01:02:55,700 Only the data is different as you keep adding. 1430 01:02:55,700 --> 01:02:57,630 So can you really take advantage of that? 1431 01:02:57,630 --> 01:03:01,350 So that is where deduplication comes in. 1432 01:03:01,350 --> 01:03:04,420 So deduplication says, I have this data, I have a lot of 1433 01:03:04,420 --> 01:03:05,120 redundant data. 1434 01:03:05,120 --> 01:03:07,380 So A B, A B, A B is redundant. 1435 01:03:07,380 --> 01:03:09,800 So what you want to do is break it up to some kind of 1436 01:03:09,800 --> 01:03:11,830 blocks in here. 1437 01:03:11,830 --> 01:03:15,350 And then one easy way to do is calculate a hash. 1438 01:03:15,350 --> 01:03:16,740 Because you don't want to compare blocks. 1439 01:03:16,740 --> 01:03:17,640 That's too much. 1440 01:03:17,640 --> 01:03:20,280 N squared comparison of blocks is a lot of comparison. 1441 01:03:20,280 --> 01:03:22,180 You can have some kind of hash calculated for 1442 01:03:22,180 --> 01:03:23,260 each of these blocks. 1443 01:03:23,260 --> 01:03:25,430 And then you can compare the hashes. 1444 01:03:25,430 --> 01:03:26,980 And if the hashes are the same, they 1445 01:03:26,980 --> 01:03:28,190 are the same blocks. 1446 01:03:28,190 --> 01:03:30,430 And then what you can do is you can eliminate most of 1447 01:03:30,430 --> 01:03:35,790 these blocks in there and then keep hashes for each block-- 1448 01:03:35,790 --> 01:03:36,960 only the hash. 1449 01:03:36,960 --> 01:03:40,900 And then what you can do is you can only keep the unique 1450 01:03:40,900 --> 01:03:41,440 blocks in here. 1451 01:03:41,440 --> 01:03:44,600 So even though you have nine in here, only five different 1452 01:03:44,600 --> 01:03:45,910 unique blocks are there. 1453 01:03:45,910 --> 01:03:47,710 So that's a nice way of deduplicating. 1454 01:03:47,710 --> 01:03:51,275 So you actually have what you call recipe, a common block 1455 01:03:51,275 --> 01:03:53,310 store in here. 1456 01:03:53,310 --> 01:03:56,130 So one way to do that is you can have a recipe on common 1457 01:03:56,130 --> 01:03:59,040 block store for each of the systems in here. 1458 01:03:59,040 --> 01:04:01,120 That's the tradition of deduplication. 1459 01:04:01,120 --> 01:04:04,930 Or what you can do is have everybody keep a recipe and 1460 01:04:04,930 --> 01:04:07,150 only have one common block store. 1461 01:04:07,150 --> 01:04:09,690 Just keep one, single common block store, and everybody 1462 01:04:09,690 --> 01:04:11,910 have a recipe, or probably cache of a recipe. 1463 01:04:11,910 --> 01:04:16,140 So by doing that, you can even reduce a huge amount of the 1464 01:04:16,140 --> 01:04:19,480 things happening in here. 1465 01:04:19,480 --> 01:04:21,430 So the interesting thing is if you are keeping one common 1466 01:04:21,430 --> 01:04:23,570 block store, who can keep, who can manage? 1467 01:04:23,570 --> 01:04:26,180 That's the interesting question in here. 1468 01:04:26,180 --> 01:04:31,140 So can you keep instead of common block store for each 1469 01:04:31,140 --> 01:04:34,320 processor, each computer, can you keep the common block 1470 01:04:34,320 --> 01:04:36,720 store for the entire world? 1471 01:04:36,720 --> 01:04:39,080 So if you find most of the common blocks in the world, 1472 01:04:39,080 --> 01:04:40,130 keep one store. 1473 01:04:40,130 --> 01:04:44,020 And the nice thing about that is then I can go 1474 01:04:44,020 --> 01:04:44,480 anywhere in the world. 1475 01:04:44,480 --> 01:04:46,880 I can ask for the common blocks for the common things. 1476 01:04:46,880 --> 01:04:48,130 I can populate it myself. 1477 01:04:52,900 --> 01:04:56,070 So here's this interesting system called the Bonsai. 1478 01:04:56,070 --> 01:04:57,930 What they did was-- 1479 01:04:57,930 --> 01:05:00,250 so if you have a block in here, you calculate a hash 1480 01:05:00,250 --> 01:05:02,970 function in here, get a hash key. 1481 01:05:02,970 --> 01:05:04,670 So what that means, the hash key can 1482 01:05:04,670 --> 01:05:06,450 uniquely access this block. 1483 01:05:06,450 --> 01:05:09,170 And then what you want to do is ask that you want to 1484 01:05:09,170 --> 01:05:11,930 compress this block. 1485 01:05:11,930 --> 01:05:14,700 This additional step, I'll explain later why it's needed. 1486 01:05:14,700 --> 01:05:17,320 So the other thing you can do is you can get a second hash 1487 01:05:17,320 --> 01:05:24,160 key and use that as a private key to encrypt this block. 1488 01:05:24,160 --> 01:05:25,230 Because you calculate two hash keys. 1489 01:05:25,230 --> 01:05:26,500 One is the hash key to identify. 1490 01:05:26,500 --> 01:05:28,810 The other one is a private key to encrypt this block. 1491 01:05:28,810 --> 01:05:31,440 And then what you can look at is this global store to see 1492 01:05:31,440 --> 01:05:33,660 whether this hash key exists. 1493 01:05:33,660 --> 01:05:35,540 If the hash key exists, then say, I got the 1494 01:05:35,540 --> 01:05:36,870 page, here is the page. 1495 01:05:36,870 --> 01:05:37,980 That's the encrypted page. 1496 01:05:37,980 --> 01:05:41,040 And each page will have a unique ID. 1497 01:05:41,040 --> 01:05:42,490 And so here's my unique ID. 1498 01:05:42,490 --> 01:05:46,230 If you find the page in here, what you can do is you can 1499 01:05:46,230 --> 01:05:49,230 only store UID and this private key and 1500 01:05:49,230 --> 01:05:50,710 get rid of my page. 1501 01:05:50,710 --> 01:05:55,040 So storing UID and private key is sufficient to get my page 1502 01:05:55,040 --> 01:05:56,070 and unencrypt it. 1503 01:05:56,070 --> 01:05:59,540 Why do you think I have to compress here? 1504 01:05:59,540 --> 01:06:03,170 Why do I have to basically do encrypt here? 1505 01:06:03,170 --> 01:06:04,550 What's the interesting thing about encryption? 1506 01:06:08,270 --> 01:06:10,150 Why encrypt? 1507 01:06:10,150 --> 01:06:11,400 Assume this is global. 1508 01:06:17,260 --> 01:06:23,520 Because if you don't encrypt, it might be a common page, but 1509 01:06:23,520 --> 01:06:26,490 it might not be something you want everybody to know. 1510 01:06:26,490 --> 01:06:34,640 So assume a large company like Google. 1511 01:06:34,640 --> 01:06:37,620 President of Google, Larry Page, sends everybody emails, 1512 01:06:37,620 --> 01:06:40,420 saying, this is very private, but here is something that's 1513 01:06:40,420 --> 01:06:41,760 happening in the company. 1514 01:06:41,760 --> 01:06:43,355 And it will get into everybody's mailbox. 1515 01:06:43,355 --> 01:06:44,560 And suddenly, it becomes-- 1516 01:06:44,560 --> 01:06:46,050 aha-- a common page. 1517 01:06:46,050 --> 01:06:49,500 And it gets sucked in the world because 1518 01:06:49,500 --> 01:06:50,050 of the common page. 1519 01:06:50,050 --> 01:06:52,560 And now everybody can see that, and that's not good. 1520 01:06:52,560 --> 01:06:54,840 But now if you have this private key-- 1521 01:06:54,840 --> 01:06:57,620 if you don't have the private key, I can't decrypt that. 1522 01:06:57,620 --> 01:06:59,030 So what happens is-- 1523 01:06:59,030 --> 01:07:01,560 let me go through what you can do in here. 1524 01:07:01,560 --> 01:07:05,760 And then what can happen is if you had these two, UID and 1525 01:07:05,760 --> 01:07:11,460 private key, you can go to the global storage and say, here's 1526 01:07:11,460 --> 01:07:12,430 the UID, give me the page. 1527 01:07:12,430 --> 01:07:13,200 It has a page. 1528 01:07:13,200 --> 01:07:14,690 It better have the page for that UID. 1529 01:07:14,690 --> 01:07:16,030 Get the page out of that. 1530 01:07:16,030 --> 01:07:19,310 And then now I can use my private key to decrypt it. 1531 01:07:19,310 --> 01:07:21,335 And then of course, I can decompress it in 1532 01:07:21,335 --> 01:07:22,460 the original page. 1533 01:07:22,460 --> 01:07:27,550 So by doing that, I can have this global system that keeps 1534 01:07:27,550 --> 01:07:31,010 common pages in there and store them. 1535 01:07:31,010 --> 01:07:33,665 I can basically, really eliminate the storage. 1536 01:07:33,665 --> 01:07:37,830 Because the nice thing is this could be, we'll say, 2K pages. 1537 01:07:37,830 --> 01:07:41,800 And this is 64 bit, and this is probably 256 bit. 1538 01:07:41,800 --> 01:07:44,652 So there's a huge compression of data in there. 1539 01:07:50,320 --> 01:07:52,110 So here are the kinds of decisions you are to make. 1540 01:07:52,110 --> 01:07:54,030 For example, hash key. 1541 01:07:54,030 --> 01:07:55,760 So each page is represented by a hash key. 1542 01:07:58,840 --> 01:08:04,940 But you can have two hash keys, two pages mapping to the 1543 01:08:04,940 --> 01:08:05,960 same hash key. 1544 01:08:05,960 --> 01:08:08,120 So my god, you're not unique. 1545 01:08:08,120 --> 01:08:09,370 So why is it OK? 1546 01:08:12,692 --> 01:08:14,090 AUDIENCE: Low probability. 1547 01:08:14,090 --> 01:08:15,440 PROFESSOR: Very low probability. 1548 01:08:15,440 --> 01:08:17,670 And in fact, what they looked at was they calculated the 1549 01:08:17,670 --> 01:08:19,450 disk failure. 1550 01:08:19,450 --> 01:08:23,210 There's a higher failure of disk failing and losing data 1551 01:08:23,210 --> 01:08:25,080 than hash collision. 1552 01:08:25,080 --> 01:08:26,800 So you can say, look, you're keeping that in the hard 1553 01:08:26,800 --> 01:08:29,540 drive, so there could be more chance of disk failure than 1554 01:08:29,540 --> 01:08:30,140 hash collision. 1555 01:08:30,140 --> 01:08:32,830 So therefore, hash key is-- 1556 01:08:32,830 --> 01:08:34,870 you can do that, low enough probability, you can 1557 01:08:34,870 --> 01:08:36,386 get away with that. 1558 01:08:36,386 --> 01:08:40,840 Actually, I want to skip the rest of this in there. 1559 01:08:40,840 --> 01:08:45,870 So here is the comparison how much compression you can give. 1560 01:08:45,870 --> 01:08:48,620 Some of them, you almost can compress 100%. 1561 01:08:48,620 --> 01:08:52,220 Because if you have a newly installed Linux box with very 1562 01:08:52,220 --> 01:08:54,899 little data, everything is in the global. 1563 01:08:54,899 --> 01:08:58,649 So by doing that, basically, here is the communication. 1564 01:08:58,649 --> 01:09:00,290 The compression is cheap. 1565 01:09:00,290 --> 01:09:02,779 You don't have to do too much because you just do this hash 1566 01:09:02,779 --> 01:09:04,220 comparison to do that. 1567 01:09:04,220 --> 01:09:09,100 And then now the communication time is even reduced because 1568 01:09:09,100 --> 01:09:10,490 you communicate a lot less. 1569 01:09:10,490 --> 01:09:11,460 And then you expand. 1570 01:09:11,460 --> 01:09:16,899 So all these three are actually now much faster to 1571 01:09:16,899 --> 01:09:19,140 send a machine across. 1572 01:09:19,140 --> 01:09:22,850 So here is a total size of all the VMs. 1573 01:09:22,850 --> 01:09:27,500 The interesting thing is if you compress within, if you 1574 01:09:27,500 --> 01:09:30,630 just do compression, most of this is zero blocks. 1575 01:09:30,630 --> 01:09:32,500 You can eliminate the zeros and do a simple 1576 01:09:32,500 --> 01:09:33,720 compression in here. 1577 01:09:33,720 --> 01:09:35,939 And if you do local deduplication, if you 1578 01:09:35,939 --> 01:09:37,500 eliminate that, and if you do global, 1579 01:09:37,500 --> 01:09:38,380 you'll get to this point. 1580 01:09:38,380 --> 01:09:44,060 So you get to about basically 30% of all your data. 1581 01:09:44,060 --> 01:09:46,526 And yeah, there's more data in here, but just-- 1582 01:09:50,109 --> 01:09:57,000 So that's an interesting global level system that 1583 01:09:57,000 --> 01:09:58,670 people are building today. 1584 01:09:58,670 --> 01:10:02,590 And I think if these kind of systems appear in many 1585 01:10:02,590 --> 01:10:05,360 different places. 1586 01:10:05,360 --> 01:10:07,265 These days, a lot of people are building things like cell 1587 01:10:07,265 --> 01:10:09,760 phone games and things like that, has this huge back 1588 01:10:09,760 --> 01:10:11,370 stores, back computation. 1589 01:10:11,370 --> 01:10:14,790 All those things have this large scalability in there. 1590 01:10:14,790 --> 01:10:19,630 And so the nice thing about performance engineering is if 1591 01:10:19,630 --> 01:10:21,700 your application doesn't require a huge amount of 1592 01:10:21,700 --> 01:10:25,010 computation or very fast processing, if you're going to 1593 01:10:25,010 --> 01:10:27,880 have millions and millions of users, or if you're expecting 1594 01:10:27,880 --> 01:10:30,880 millions of users, then building these systems, and 1595 01:10:30,880 --> 01:10:32,290 understanding, and building 1596 01:10:32,290 --> 01:10:35,460 scalability is really important. 1597 01:10:35,460 --> 01:10:38,360 And a lot of the things we learned in this class is 1598 01:10:38,360 --> 01:10:40,760 directly applicable there. 1599 01:10:40,760 --> 01:10:42,760 So I have... 1600 01:10:42,760 --> 01:10:48,290 Any other questions before you guys can go finish your 1601 01:10:48,290 --> 01:10:50,850 project report and go have a nice Thanksgiving dinner? 1602 01:10:54,560 --> 01:10:58,830 So everybody is thinking that they will be able to get a 1603 01:10:58,830 --> 01:11:02,660 good handle on what they are doing for the final project? 1604 01:11:02,660 --> 01:11:04,870 Oh come on, there are so many cool things you can do with 1605 01:11:04,870 --> 01:11:06,120 this project. 1606 01:11:09,214 --> 01:11:10,464 AUDIENCE: We're working on it. 1607 01:11:13,040 --> 01:11:14,410 PROFESSOR: And there are-- 1608 01:11:14,410 --> 01:11:14,740 whatever-- 1609 01:11:14,740 --> 01:11:18,410 three or four iPod Nanos waiting. 1610 01:11:18,410 --> 01:11:20,910 So it makes a lot of sense to actually 1611 01:11:20,910 --> 01:11:21,970 really focus this one. 1612 01:11:21,970 --> 01:11:22,890 This is a fun project. 1613 01:11:22,890 --> 01:11:24,220 This is, in fact, a fun project. 1614 01:11:24,220 --> 01:11:28,410 Because talk to these guys what they did last year. 1615 01:11:28,410 --> 01:11:31,220 People did a lot of interesting things. 1616 01:11:31,220 --> 01:11:33,030 Because we gave you freedom to actually even change the 1617 01:11:33,030 --> 01:11:34,610 algorithms. 1618 01:11:34,610 --> 01:11:36,930 So you can actually look at the algorithm. 1619 01:11:36,930 --> 01:11:39,720 If you know a little bit of physics and graphics type 1620 01:11:39,720 --> 01:11:42,570 stuff, look at them and say, look, can I even reduce 1621 01:11:42,570 --> 01:11:44,980 computation of how the computation is being done? 1622 01:11:44,980 --> 01:11:47,590 So people got a lot of wins by doing things like that. 1623 01:11:47,590 --> 01:11:50,346 And of course, parallelization matters. 1624 01:11:50,346 --> 01:11:53,330 And there are a lot of optimization possibilities in 1625 01:11:53,330 --> 01:11:55,340 this piece of code. 1626 01:11:55,340 --> 01:11:57,200 So take a look at that. 1627 01:11:57,200 --> 01:11:58,260 Just have a plan. 1628 01:11:58,260 --> 01:12:01,250 Just don't go blindly into it, just have a plan. 1629 01:12:01,250 --> 01:12:03,440 Run it, profile it, get some feedback. 1630 01:12:03,440 --> 01:12:06,450 You need to get this for your presentations. 1631 01:12:06,450 --> 01:12:10,330 So run, profile, get some feedback, have a good plan. 1632 01:12:10,330 --> 01:12:12,660 Go attack it. 1633 01:12:12,660 --> 01:12:14,840 So see you in a week.