1 00:00:00,000 --> 00:00:11,740 -- is the next group of topics in 6.033 call fault tolerance. 2 00:00:11,740 --> 00:00:16,900 And the goal here is to learn how to build reliable systems. 3 00:00:16,900 --> 00:00:19,730 An extreme case, or at least our ideal goal, 4 00:00:19,730 --> 00:00:22,310 is to try to build systems that will never fail. 5 00:00:22,310 --> 00:00:24,490 And what will find is that we really can't do that, 6 00:00:24,490 --> 00:00:26,250 but what we'll try to do is to build 7 00:00:26,250 --> 00:00:31,560 systems which maybe fail less often than if you built them 8 00:00:31,560 --> 00:00:35,041 without the principles that we're going to talk about. 9 00:00:35,041 --> 00:00:36,916 So the idea is how to build reliable systems. 10 00:00:41,080 --> 00:00:43,610 So in order to understand how to build reliable systems, 11 00:00:43,610 --> 00:00:46,460 we need to understand what makes systems unreliable. 12 00:00:46,460 --> 00:00:52,260 And that has to do with understanding what faults are. 13 00:00:52,260 --> 00:00:57,170 What problems occur in systems that cause systems to fail? 14 00:00:57,170 --> 00:01:00,200 And you've actually seen many examples of faults already. 15 00:01:00,200 --> 00:01:02,340 Informally, a fault is just some kind 16 00:01:02,340 --> 00:01:08,151 of a flaw or a mistake that causes a component or a module 17 00:01:08,151 --> 00:01:10,150 not to perform the way it's supposed to perform. 18 00:01:10,150 --> 00:01:12,530 And we'll formalize this notion a little bit 19 00:01:12,530 --> 00:01:14,690 today as we go along. 20 00:01:14,690 --> 00:01:17,290 So there are many examples of faults, several of which 21 00:01:17,290 --> 00:01:18,705 you've already seen. 22 00:01:18,705 --> 00:01:20,580 A system could fail because it has a software 23 00:01:20,580 --> 00:01:23,950 fault, a bug in a piece of software, so when you run it, 24 00:01:23,950 --> 00:01:26,680 it doesn't work according to the way you expect. 25 00:01:26,680 --> 00:01:30,560 And that causes something bad to happen. 26 00:01:30,560 --> 00:01:34,540 You might have hardware faults. 27 00:01:34,540 --> 00:01:35,790 You store some data on a disk. 28 00:01:35,790 --> 00:01:36,660 You go back and read it. 29 00:01:36,660 --> 00:01:37,510 And it isn't there. 30 00:01:37,510 --> 00:01:39,020 But it's been corrupted. 31 00:01:39,020 --> 00:01:40,940 And that's an example of a fault that 32 00:01:40,940 --> 00:01:42,700 might cause bad things to happen if you 33 00:01:42,700 --> 00:01:48,690 build a system that relies on a disk storing data persistently. 34 00:01:48,690 --> 00:01:54,500 You might have design faults where a design fault might 35 00:01:54,500 --> 00:01:58,130 be something where you try to, let's say, 36 00:01:58,130 --> 00:02:01,720 figure out how much buffering to put into a network switch. 37 00:02:01,720 --> 00:02:03,200 And you put into little buffering. 38 00:02:03,200 --> 00:02:05,616 So what ends up happening is too many packets get dropped. 39 00:02:05,616 --> 00:02:09,610 So you might actually just have some bad logic in there, 40 00:02:09,610 --> 00:02:13,730 and that causes you to design something that isn't quite 41 00:02:13,730 --> 00:02:15,514 going to work out. 42 00:02:15,514 --> 00:02:17,680 And you might, of course, have implementation faults 43 00:02:17,680 --> 00:02:19,650 where you have a design, and then you implement it, 44 00:02:19,650 --> 00:02:21,700 and you made a mistake in how it's implemented. 45 00:02:21,700 --> 00:02:24,700 And that could cause faults as well. 46 00:02:24,700 --> 00:02:26,760 And another example of the kind of faults 47 00:02:26,760 --> 00:02:30,700 is an operational fault sometimes called a human error 48 00:02:30,700 --> 00:02:33,220 where a user actually does something 49 00:02:33,220 --> 00:02:35,960 that you didn't anticipate or was told not to do, 50 00:02:35,960 --> 00:02:39,842 and that caused bad things to happen. 51 00:02:39,842 --> 00:02:41,550 For all of these faults, there are really 52 00:02:41,550 --> 00:02:43,370 two categories of faults regardless 53 00:02:43,370 --> 00:02:45,840 of what kind of fault it is. 54 00:02:45,840 --> 00:02:48,955 The first category of faults are latent faults. 55 00:02:51,860 --> 00:02:54,250 So an example of a latent fault is, 56 00:02:54,250 --> 00:02:57,810 let's say you have a bug in a program where 57 00:02:57,810 --> 00:03:00,550 instead of testing if A less than B, 58 00:03:00,550 --> 00:03:02,540 you test if A greater than B. 59 00:03:02,540 --> 00:03:05,630 So that's a bug in a program. 60 00:03:05,630 --> 00:03:08,640 But until it actually runs, until that line of code runs, 61 00:03:08,640 --> 00:03:10,319 this fault in the program isn't actually 62 00:03:10,319 --> 00:03:11,360 going to do anything bad. 63 00:03:11,360 --> 00:03:14,330 It isn't going to have any adverse effect. 64 00:03:14,330 --> 00:03:17,030 And therefore, this fault is an example of a latent fault. 65 00:03:17,030 --> 00:03:20,410 And nothing is happening until it gets triggered. 66 00:03:20,410 --> 00:03:22,970 And when it gets triggered, that latent fault 67 00:03:22,970 --> 00:03:28,751 might become an active fault. 68 00:03:28,751 --> 00:03:31,250 Now, the problem when a latent fault becomes an active fault 69 00:03:31,250 --> 00:03:35,150 is that when you run that line of code, 70 00:03:35,150 --> 00:03:38,030 you might have a mistake coming out at the output, 71 00:03:38,030 --> 00:03:40,820 which we're going to call an error. 72 00:03:40,820 --> 00:03:43,590 So when an active fault is exercised, 73 00:03:43,590 --> 00:03:45,370 it leads to an error. 74 00:03:45,370 --> 00:03:48,326 And the problem with errors is that if you're not 75 00:03:48,326 --> 00:03:49,950 careful about how you deal with errors, 76 00:03:49,950 --> 00:03:51,570 and most of what we're going to talk about 77 00:03:51,570 --> 00:03:53,380 is how to deal with errors, if you're not 78 00:03:53,380 --> 00:03:56,300 careful about how you deal with errors that leads to a failure. 79 00:04:02,040 --> 00:04:03,530 So somewhat more formally, a fault 80 00:04:03,530 --> 00:04:06,020 is just any flaw in an underlying component 81 00:04:06,020 --> 00:04:09,710 or underlying subsystem that your system is using. 82 00:04:09,710 --> 00:04:13,172 Now, if the fault turns out not to be exercised then. 83 00:04:13,172 --> 00:04:13,880 There's no error. 84 00:04:13,880 --> 00:04:15,520 There's no error that results, and there's 85 00:04:15,520 --> 00:04:16,520 no failure that results. 86 00:04:16,520 --> 00:04:18,320 It's only when you have an active fault 87 00:04:18,320 --> 00:04:19,990 that you might have an error. 88 00:04:19,990 --> 00:04:22,736 And when you have an error, you might have a failure. 89 00:04:22,736 --> 00:04:24,110 And what we're going to try to do 90 00:04:24,110 --> 00:04:27,780 is to understand how to deal with these errors 91 00:04:27,780 --> 00:04:29,450 so that when errors occur we're going 92 00:04:29,450 --> 00:04:32,710 to try to hide them or mask them, or do something 93 00:04:32,710 --> 00:04:35,900 such that these errors don't propagate and cause failures. 94 00:04:40,330 --> 00:04:42,150 So the general goal, as I mentioned before, 95 00:04:42,150 --> 00:04:46,850 is to build systems that don't fail. 96 00:04:46,850 --> 00:04:49,200 So, in order to build systems that don't fail, 97 00:04:49,200 --> 00:04:52,440 there are two approaches at a 50,000 foot level. 98 00:04:52,440 --> 00:04:54,590 One approach to build a system that doesn't fail 99 00:04:54,590 --> 00:04:56,059 is to build it out of, make sure, 100 00:04:56,059 --> 00:04:57,600 every system is going to be built out 101 00:04:57,600 --> 00:04:58,642 of components or modules. 102 00:04:58,642 --> 00:05:00,433 And those modules are going to be built out 103 00:05:00,433 --> 00:05:01,590 of modules themselves. 104 00:05:01,590 --> 00:05:04,050 One approach might be to make sure that no module ever 105 00:05:04,050 --> 00:05:06,960 fails, that no component that you use to build your bigger 106 00:05:06,960 --> 00:05:08,750 system ever fails. 107 00:05:08,750 --> 00:05:11,360 And it'll turn out that for reasons that will become clear 108 00:05:11,360 --> 00:05:13,200 based on an understanding of the techniques 109 00:05:13,200 --> 00:05:16,670 we are going to employ to build systems that don't fail, 110 00:05:16,670 --> 00:05:19,205 it'll turn out that this is extremely expensive. 111 00:05:19,205 --> 00:05:21,330 It's just not going to work out for us to make sure 112 00:05:21,330 --> 00:05:23,750 that our disks never fail, and memory never 113 00:05:23,750 --> 00:05:26,530 fails, and our networks never fail, and so on. 114 00:05:26,530 --> 00:05:29,774 It's just too expensive and nearly impossible. 115 00:05:29,774 --> 00:05:31,190 So what we're going to do actually 116 00:05:31,190 --> 00:05:33,100 is to start with unreliable components. 117 00:05:40,150 --> 00:05:42,170 And we're going to build reliable systems out 118 00:05:42,170 --> 00:05:46,940 of unreliable components or modules more generally. 119 00:05:49,450 --> 00:05:53,310 And what this means is that the system that you build 120 00:05:53,310 --> 00:05:57,060 had better be tolerant of faults that these underlying 121 00:05:57,060 --> 00:06:00,780 components have, which is why the design of systems 122 00:06:00,780 --> 00:06:04,960 that don't fail or rarely fail is essentially 123 00:06:04,960 --> 00:06:07,480 the same as designing systems that are tolerant of faults, 124 00:06:07,480 --> 00:06:08,680 hence fault tolerance. 125 00:06:08,680 --> 00:06:13,710 So that's the reason why we care about fault tolerance. 126 00:06:13,710 --> 00:06:15,720 So let's take the example of the kinds 127 00:06:15,720 --> 00:06:19,500 of, just to crystallize these notions of faults and failures 128 00:06:19,500 --> 00:06:20,420 a little bit more. 129 00:06:20,420 --> 00:06:23,460 So let's say you have a big system that has a module. 130 00:06:23,460 --> 00:06:24,620 Let's call it M1. 131 00:06:24,620 --> 00:06:30,970 And this module uses a couple of other modules, M2 and M3. 132 00:06:30,970 --> 00:06:34,990 And let's say M2 uses another module, M4, 133 00:06:34,990 --> 00:06:36,850 where users might be an [indication?]. 134 00:06:36,850 --> 00:06:40,570 Or imagine this is an RPC call, for example. 135 00:06:43,420 --> 00:06:46,800 And, let's say that M4 in here has some component inside M4 136 00:06:46,800 --> 00:06:49,970 like a disk or something, some piece of software in M4. 137 00:06:49,970 --> 00:06:53,020 And, let's say that that fails. 138 00:06:53,020 --> 00:06:53,840 So, it has a fault. 139 00:06:53,840 --> 00:06:54,690 It gets triggered. 140 00:06:54,690 --> 00:06:56,368 It becomes active, leads to an error. 141 00:06:56,368 --> 00:06:58,076 It actually fails, that little component. 142 00:07:01,100 --> 00:07:03,890 So when this fault becomes a failure, a couple of things 143 00:07:03,890 --> 00:07:05,340 could happen. 144 00:07:05,340 --> 00:07:14,069 M4, which is the module to which this little failure belongs, 145 00:07:14,069 --> 00:07:15,110 can do one of two things. 146 00:07:15,110 --> 00:07:18,500 One possibility is that this fault 147 00:07:18,500 --> 00:07:21,310 that caused the failure gets exposed to the caller. 148 00:07:21,310 --> 00:07:23,800 So, M4 hasn't managed to figure out 149 00:07:23,800 --> 00:07:27,160 a way to hide this failure from M2, which means 150 00:07:27,160 --> 00:07:29,170 that the fault propagates up. 151 00:07:29,170 --> 00:07:32,690 The failure gets visible, and the fault propagates up to M2. 152 00:07:32,690 --> 00:07:38,920 And now, M2 actually sees the underlying component's failure. 153 00:07:38,920 --> 00:07:42,670 So the point here is that this little component fault 154 00:07:42,670 --> 00:07:44,620 caused a failure here which caused 155 00:07:44,620 --> 00:07:49,610 M4 itself to fail because M4 now couldn't hide this underlying 156 00:07:49,610 --> 00:07:52,020 failure, and reported something that 157 00:07:52,020 --> 00:07:54,880 was a failure, that was an output that 158 00:07:54,880 --> 00:07:59,030 didn't conform to the specification of M4 out to M2. 159 00:07:59,030 --> 00:08:02,810 Now, as far as M2 is concerned, all that has happened so far 160 00:08:02,810 --> 00:08:06,240 is that the failure of this module, M4, 161 00:08:06,240 --> 00:08:10,250 has shown up as a fault to M2, right, 162 00:08:10,250 --> 00:08:12,780 because an underlying module has failed. 163 00:08:12,780 --> 00:08:14,590 It doesn't mean that M2 has failed. 164 00:08:14,590 --> 00:08:17,510 It just means that M2 has now seen a fault. 165 00:08:17,510 --> 00:08:21,150 And M2 now might manage to hide this fault, which 166 00:08:21,150 --> 00:08:23,400 would mean that M1 doesn't actually see anything. 167 00:08:23,400 --> 00:08:26,480 It doesn't see the underlying fault 168 00:08:26,480 --> 00:08:28,600 that caused the failure at all. 169 00:08:28,600 --> 00:08:30,940 But of course, if M2 now couldn't hide this 170 00:08:30,940 --> 00:08:33,000 or couldn't mask this failure, then it 171 00:08:33,000 --> 00:08:36,010 would propagate an erroneous output out 172 00:08:36,010 --> 00:08:37,809 to M1, an output that didn't conform 173 00:08:37,809 --> 00:08:40,570 to the specification of M2, leading M1 174 00:08:40,570 --> 00:08:43,070 to observe this as a fault, and so on. 175 00:08:43,070 --> 00:08:47,100 So, the general idea is that failures of sub-modules 176 00:08:47,100 --> 00:08:49,550 tend to show up as faults in the higher level module. 177 00:08:53,240 --> 00:08:55,240 And our goal is to try to somehow design 178 00:08:55,240 --> 00:08:57,480 these systems that use lots of modules and components 179 00:08:57,480 --> 00:09:00,370 where at some level in the end we would 180 00:09:00,370 --> 00:09:02,610 like to avoid failing overall. 181 00:09:02,610 --> 00:09:06,270 But inside here, we won't be able to go about making 182 00:09:06,270 --> 00:09:08,340 everything failure-free. 183 00:09:08,340 --> 00:09:10,660 I mean, there might be failures inside sub-modules. 184 00:09:10,660 --> 00:09:13,410 But the idea is to ensure, or try to ensure, 185 00:09:13,410 --> 00:09:15,920 that M1 itself, the highest level system, doesn't fail. 186 00:09:18,700 --> 00:09:20,225 So let's start with a few examples. 187 00:09:24,427 --> 00:09:27,010 In fact, these all examples of things that we've already seen. 188 00:09:27,010 --> 00:09:29,650 And even though we haven't discussed it as such, 189 00:09:29,650 --> 00:09:32,720 we've seen a lot of examples of fault tolerance in the class 190 00:09:32,720 --> 00:09:33,220 so far. 191 00:09:33,220 --> 00:09:37,885 So, for example, if you have bad synchronization code 192 00:09:37,885 --> 00:09:40,010 like you didn't use the locking discipline properly 193 00:09:40,010 --> 00:09:44,870 or didn't use any of the other synchronization primitives 194 00:09:44,870 --> 00:09:50,520 properly, you might have a software fault that leads 195 00:09:50,520 --> 00:09:53,740 to the failure of a module. 196 00:09:53,740 --> 00:09:56,650 Another example that we saw when we talk about networking 197 00:09:56,650 --> 00:10:02,450 is when we talk about routing where the idea in here 198 00:10:02,450 --> 00:10:04,880 was that we talked about rat in protocols that 199 00:10:04,880 --> 00:10:06,350 could handle failures of links. 200 00:10:06,350 --> 00:10:09,070 So, certain links could fail, leading to certain paths 201 00:10:09,070 --> 00:10:10,700 to not be usable. 202 00:10:10,700 --> 00:10:13,060 But, the routing system managed to find other paths 203 00:10:13,060 --> 00:10:13,980 around the network. 204 00:10:13,980 --> 00:10:15,500 And that was because there were other parts 205 00:10:15,500 --> 00:10:16,999 available because the network itself 206 00:10:16,999 --> 00:10:19,350 was built with some degree of redundancy underneath. 207 00:10:19,350 --> 00:10:24,440 And the routing protocol was able to exploit that. 208 00:10:24,440 --> 00:10:27,440 Another example that we saw again from networks 209 00:10:27,440 --> 00:10:30,474 is packet loss. 210 00:10:30,474 --> 00:10:32,640 We had best effort networks that would lose packets. 211 00:10:32,640 --> 00:10:35,590 And it didn't mean that your actual transfer 212 00:10:35,590 --> 00:10:38,760 of a file at the ends and where would miss data. 213 00:10:38,760 --> 00:10:41,370 We came up with retransmissions as a mechanism 214 00:10:41,370 --> 00:10:44,120 to use, again, another form of redundancy 215 00:10:44,120 --> 00:10:46,620 where you try the same thing again to get your data through. 216 00:10:50,090 --> 00:10:52,930 Another example of the failure that we saw was congestion 217 00:10:52,930 --> 00:10:53,904 collapse -- 218 00:10:56,810 --> 00:11:02,500 -- where there was too much data being sent out into the network 219 00:11:02,500 --> 00:11:04,500 too fast, and the network would collapse. 220 00:11:04,500 --> 00:11:06,350 And our solution to this problem was really 221 00:11:06,350 --> 00:11:09,800 to shed load was to run the system slower than it otherwise 222 00:11:09,800 --> 00:11:12,100 would, by having the people sending 223 00:11:12,100 --> 00:11:16,440 data send data slower in order to alleviate this problem. 224 00:11:16,440 --> 00:11:18,930 Another example which we saw last time was, 225 00:11:18,930 --> 00:11:22,590 or briefly saw last time was [the domain?] name system where 226 00:11:22,590 --> 00:11:24,320 the domain name servers are replicated. 227 00:11:24,320 --> 00:11:26,900 So, if you couldn't reach one to resolve your domain name, 228 00:11:26,900 --> 00:11:28,830 you could go to another one. 229 00:11:28,830 --> 00:11:30,290 And all of these, or most of these 230 00:11:30,290 --> 00:11:33,159 actually use the same techniques that we're going to talk about. 231 00:11:33,159 --> 00:11:34,700 And all of these techniques are built 232 00:11:34,700 --> 00:11:39,880 around some form of redundancy on another except probably 233 00:11:39,880 --> 00:11:40,630 the locking thing. 234 00:11:40,630 --> 00:11:43,830 But all of the others are built around some form of redundancy. 235 00:11:43,830 --> 00:11:46,940 And we'll understand this more systematically today 236 00:11:46,940 --> 00:11:49,740 and in the next couple of classes. 237 00:11:49,740 --> 00:11:52,401 So our goal here is to develop a systematic approach -- 238 00:11:55,490 --> 00:12:02,640 -- to building systems that are fault tolerant. 239 00:12:02,640 --> 00:12:04,940 And the general approach for all fault tolerant systems 240 00:12:04,940 --> 00:12:06,750 is to use three techniques. 241 00:12:06,750 --> 00:12:09,180 So the first one we've already seen, which is 242 00:12:09,180 --> 00:12:10,550 don't build a monolithic system. 243 00:12:10,550 --> 00:12:11,841 Always build it around modules. 244 00:12:15,110 --> 00:12:16,750 And the reason is that it will that 245 00:12:16,750 --> 00:12:20,580 be easier for us to isolate these modules firstly 246 00:12:20,580 --> 00:12:21,660 one from another. 247 00:12:21,660 --> 00:12:24,180 But then, when modules fail, it will 248 00:12:24,180 --> 00:12:26,660 be easier for us to treat those failures as faults, 249 00:12:26,660 --> 00:12:28,670 and then try to hide those faults, 250 00:12:28,670 --> 00:12:33,090 and apply the same technique, which brings us 251 00:12:33,090 --> 00:12:36,890 to the second step, which is when failures occur causing 252 00:12:36,890 --> 00:12:41,110 errors, we need a plan for the higher level 253 00:12:41,110 --> 00:12:42,680 module to detect errors. 254 00:12:47,010 --> 00:12:48,420 So failure results in an error. 255 00:12:48,420 --> 00:12:51,180 We have to know that it's happened, 256 00:12:51,180 --> 00:12:53,796 which means we need techniques to detect it. 257 00:12:53,796 --> 00:12:55,420 And, of course, once we detect an error 258 00:12:55,420 --> 00:12:57,230 we have a bunch of things we could do with it. 259 00:12:57,230 --> 00:12:58,688 But ideally, if you want to prevent 260 00:12:58,688 --> 00:13:01,640 the failure of that system, of a system that's observed errors, 261 00:13:01,640 --> 00:13:04,820 you need a way to hide these errors. 262 00:13:04,820 --> 00:13:06,790 The jargon for this is mask errors. 263 00:13:12,110 --> 00:13:14,390 And if we do this, if we build systems that do this, 264 00:13:14,390 --> 00:13:16,250 then it's possible for us to build 265 00:13:16,250 --> 00:13:18,330 systems that can form to spec. 266 00:13:18,330 --> 00:13:20,585 So the goal here is to try to make sure that systems 267 00:13:20,585 --> 00:13:21,835 conform to some specification. 268 00:13:25,922 --> 00:13:27,880 And if things don't conform to a specification, 269 00:13:27,880 --> 00:13:31,390 then that's when we call it a failure. 270 00:13:31,390 --> 00:13:32,840 And sometimes we play some tricks 271 00:13:32,840 --> 00:13:36,760 where in order to build systems that "never fail", 272 00:13:36,760 --> 00:13:39,670 we'll scale back the specification to actually allow 273 00:13:39,670 --> 00:13:43,670 for things that would in fact be considered failures, 274 00:13:43,670 --> 00:13:46,120 but are things that still would conform to the spec. 275 00:13:46,120 --> 00:13:48,180 So we relax the specification to make sure 276 00:13:48,180 --> 00:13:53,130 that we could still meet the notion of a failure free system 277 00:13:53,130 --> 00:13:54,290 or a fault tolerant system. 278 00:13:54,290 --> 00:13:56,280 And we'll see some examples of that 279 00:13:56,280 --> 00:13:57,700 actually in the next lecture. 280 00:14:00,410 --> 00:14:02,840 And I've already mentioned, the general trick 281 00:14:02,840 --> 00:14:06,240 for all of these systems that we're going to study, 282 00:14:06,240 --> 00:14:07,750 examples that we're going to study, 283 00:14:07,750 --> 00:14:09,890 is to use some form of redundance. 284 00:14:12,779 --> 00:14:15,070 And that's the way in which we're going to mask errors. 285 00:14:15,070 --> 00:14:19,560 And almost all systems, or every system 286 00:14:19,560 --> 00:14:21,060 that I know of that's fault tolerant 287 00:14:21,060 --> 00:14:23,160 uses redundancy in some form or another. 288 00:14:23,160 --> 00:14:25,150 And often it's not obvious how it uses it. 289 00:14:25,150 --> 00:14:27,020 But it does actually use redundancy. 290 00:14:32,160 --> 00:14:35,260 So I'm going to now give an example that will turn out 291 00:14:35,260 --> 00:14:38,070 to be the same example we'll use for the next three or four 292 00:14:38,070 --> 00:14:38,800 lectures. 293 00:14:38,800 --> 00:14:40,947 And so, you may as well, you should probably 294 00:14:40,947 --> 00:14:42,780 get familiar with this example because we're 295 00:14:42,780 --> 00:14:44,400 going to see this over and over again. 296 00:14:44,400 --> 00:14:47,970 It's a really simple example, but it's complicated enough 297 00:14:47,970 --> 00:14:50,670 that everything we want to learn about fault tolerance 298 00:14:50,670 --> 00:14:52,907 will be visible in this example. 299 00:14:52,907 --> 00:14:54,490 So it starts with the person who wants 300 00:14:54,490 --> 00:15:03,640 to do a bank transaction at an ATM, or a PC, or on a computer. 301 00:15:03,640 --> 00:15:05,140 You want to do a bank transaction. 302 00:15:05,140 --> 00:15:07,580 And the way this works, as you probably know, is it 303 00:15:07,580 --> 00:15:11,150 goes over some kind of a network. 304 00:15:11,150 --> 00:15:14,600 And then, if you want to do this bank transaction,, 305 00:15:14,600 --> 00:15:20,060 it goes to a server, which is run by your bank. 306 00:15:20,060 --> 00:15:21,640 And the way this normally works is 307 00:15:21,640 --> 00:15:24,250 that the server has a module that 308 00:15:24,250 --> 00:15:32,870 uses a database system, which deals with managing 309 00:15:32,870 --> 00:15:34,530 your account information. 310 00:15:34,530 --> 00:15:36,921 And because you don't want to forget, 311 00:15:36,921 --> 00:15:39,170 and the bank shouldn't forget how much money you have, 312 00:15:39,170 --> 00:15:42,960 there is data that's stored on disk. 313 00:15:42,960 --> 00:15:45,750 And we're going to be doing things that are 314 00:15:45,750 --> 00:15:47,810 actions of the following form. 315 00:15:47,810 --> 00:15:53,540 We're going to be asking the transfer 316 00:15:53,540 --> 00:15:58,340 from some account to another account some amount of money. 317 00:16:02,530 --> 00:16:05,290 And now, of course, anything could fail in between. 318 00:16:05,290 --> 00:16:08,200 So, for example, there could be a problem in the network. 319 00:16:08,200 --> 00:16:09,470 And the network could fail. 320 00:16:09,470 --> 00:16:11,320 Or the software running on the server could fail. 321 00:16:11,320 --> 00:16:13,153 Or the software running this database system 322 00:16:13,153 --> 00:16:17,580 could crash or report bad values or something. 323 00:16:17,580 --> 00:16:19,150 The disc could fail. 324 00:16:19,150 --> 00:16:22,790 And do we want systematic techniques by which this 325 00:16:22,790 --> 00:16:25,720 transfer here or all of these calls 326 00:16:25,720 --> 00:16:29,260 that look a lot like transfer do the right thing? 327 00:16:29,260 --> 00:16:31,250 And so, this doing the right thing 328 00:16:31,250 --> 00:16:34,150 is an informal way of saying meet a specification. 329 00:16:34,150 --> 00:16:38,190 So, we first have to decide what we want for a specification. 330 00:16:38,190 --> 00:16:40,290 That has to hold true no matter what happens, 331 00:16:40,290 --> 00:16:41,810 no matter what failures occur. 332 00:16:41,810 --> 00:16:43,730 So one example of a specification 333 00:16:43,730 --> 00:16:47,410 might be to say, no matter what happens, if I invoke this 334 00:16:47,410 --> 00:16:51,140 and it returns, then this amount of money 335 00:16:51,140 --> 00:16:53,430 has to be transferred from here to here. 336 00:16:53,430 --> 00:16:57,170 So that could be a specification that you might expect. 337 00:16:57,170 --> 00:16:59,420 It also turns out this specification is extremely hard 338 00:16:59,420 --> 00:16:59,750 to meet. 339 00:16:59,750 --> 00:17:01,010 And we're not even going to try to do it. 340 00:17:01,010 --> 00:17:03,259 And this is the [weasel?] wording I said before about, 341 00:17:03,259 --> 00:17:04,750 we'll modify the specification. 342 00:17:04,750 --> 00:17:07,359 So, we'll change the specification going forward 343 00:17:07,359 --> 00:17:11,079 for this example to mean, if this call returns, 344 00:17:11,079 --> 00:17:14,160 then no matter what failures occur, 345 00:17:14,160 --> 00:17:17,349 either a transfer has happened exactly once, 346 00:17:17,349 --> 00:17:20,450 or the state of the system is as if the transfer didn't even get 347 00:17:20,450 --> 00:17:23,770 started, OK, which is reasonable. 348 00:17:23,770 --> 00:17:27,249 I mean, and then if you really care about moving the money, 349 00:17:27,249 --> 00:17:29,290 and you are determined that it hasn't been moved, 350 00:17:29,290 --> 00:17:31,340 you or some program might try it again, 351 00:17:31,340 --> 00:17:34,460 which actually is another form of using redundancy where 352 00:17:34,460 --> 00:17:37,500 you just try it over again. 353 00:17:37,500 --> 00:17:41,280 And we won't understand completely 354 00:17:41,280 --> 00:17:45,130 why a specification that says you have to do this exactly 355 00:17:45,130 --> 00:17:47,940 once if it returns, why that's hard to implement, 356 00:17:47,940 --> 00:17:49,801 why that's hard to achieve, we'll see that 357 00:17:49,801 --> 00:17:51,050 in the next couple of classes. 358 00:17:51,050 --> 00:17:53,550 So, for now, just realized that the specification here 359 00:17:53,550 --> 00:17:55,750 is it should happen exactly once or it should 360 00:17:55,750 --> 00:18:00,527 be as if no partial action corresponding 361 00:18:00,527 --> 00:18:02,610 to the [UNINTELLIGIBLE] of this transfer happened. 362 00:18:02,610 --> 00:18:05,710 So the state of the system must be as if the system never 363 00:18:05,710 --> 00:18:06,810 saw this transfer request. 364 00:18:10,350 --> 00:18:11,750 So any module could failure. 365 00:18:11,750 --> 00:18:13,560 So, let's take some examples of failures 366 00:18:13,560 --> 00:18:15,820 in order to get some terminology that'll 367 00:18:15,820 --> 00:18:18,471 help us understand faults. 368 00:18:18,471 --> 00:18:20,220 So one thing that could happen is that you 369 00:18:20,220 --> 00:18:22,780 could have a disk failure. 370 00:18:22,780 --> 00:18:24,360 So the disc could just fail. 371 00:18:24,360 --> 00:18:28,130 And one example of a disk failure is the disk fails 372 00:18:28,130 --> 00:18:30,385 and then it just stops working. 373 00:18:30,385 --> 00:18:32,010 And it tells the database system that's 374 00:18:32,010 --> 00:18:34,510 trying to read and write data from it that it isn't working. 375 00:18:36,720 --> 00:18:40,192 So if that kind of failure happens where this module here 376 00:18:40,192 --> 00:18:41,650 with this component just completely 377 00:18:41,650 --> 00:18:44,870 stops and tells the higher-level module that it stopped, 378 00:18:44,870 --> 00:18:46,230 that's an example of a failure. 379 00:18:46,230 --> 00:18:47,646 That's called a fail stop failure. 380 00:18:50,522 --> 00:18:51,980 And more generally, any module that 381 00:18:51,980 --> 00:18:54,370 tells the higher-level module that it just 382 00:18:54,370 --> 00:18:58,430 stops working without reporting anything else, no outputs, 383 00:18:58,430 --> 00:19:00,070 that's fail stop. 384 00:19:00,070 --> 00:19:04,020 Of course, you could have disks. 385 00:19:04,020 --> 00:19:07,430 And you might have failures that aren't fail stop. 386 00:19:07,430 --> 00:19:09,870 You might have something where there 387 00:19:09,870 --> 00:19:11,990 is some kind of error checking associated 388 00:19:11,990 --> 00:19:13,570 with every sector on your disk. 389 00:19:13,570 --> 00:19:17,650 And, disk might start reporting errors that 390 00:19:17,650 --> 00:19:19,557 say that this is a bad sector. 391 00:19:19,557 --> 00:19:21,890 So, it doesn't fail stop, but it tells the higher level, 392 00:19:21,890 --> 00:19:25,100 the database system in this case that some data that's read, 393 00:19:25,100 --> 00:19:27,260 or some data that's been written, 394 00:19:27,260 --> 00:19:29,740 there's a bad sector, which means 395 00:19:29,740 --> 00:19:34,450 that the checksum doesn't match the data that's being read. 396 00:19:34,450 --> 00:19:37,730 When you have an error like that where 397 00:19:37,730 --> 00:19:39,720 it doesn't stop working but it tells you 398 00:19:39,720 --> 00:19:42,780 that something bad is going on, that's an example of a failure. 399 00:19:42,780 --> 00:19:44,250 That's called a fail fast failure. 400 00:19:47,041 --> 00:19:49,540 I actually don't think these terms, that most of these terms 401 00:19:49,540 --> 00:19:50,665 are particularly important. 402 00:19:50,665 --> 00:19:52,834 Fail stop is usually important and worth knowing, 403 00:19:52,834 --> 00:19:54,500 but the reason to go through these terms 404 00:19:54,500 --> 00:19:57,900 is more to understand that there are various kinds of failures 405 00:19:57,900 --> 00:19:58,470 possible. 406 00:19:58,470 --> 00:20:00,094 So in one case it stops working. 407 00:20:00,094 --> 00:20:01,510 In another case, it just tells you 408 00:20:01,510 --> 00:20:03,770 that it's not working but continues working. 409 00:20:03,770 --> 00:20:06,240 It tells you that certain operations 410 00:20:06,240 --> 00:20:07,640 haven't been correctly done. 411 00:20:10,610 --> 00:20:13,420 Now, another thing that could happen when, for example, 412 00:20:13,420 --> 00:20:19,430 the disc has fail stop, has fail fast 413 00:20:19,430 --> 00:20:21,420 is that the database system might decide 414 00:20:21,420 --> 00:20:25,980 that right operations, you're not allowed to write things 415 00:20:25,980 --> 00:20:30,050 to disk because the disk is either fail completely or is 416 00:20:30,050 --> 00:20:31,240 fail fast. 417 00:20:31,240 --> 00:20:35,700 But it might allow actions or requests that are read only. 418 00:20:35,700 --> 00:20:37,290 So, for example, it might allow users 419 00:20:37,290 --> 00:20:38,890 to come up to an ATM machine, and just 420 00:20:38,890 --> 00:20:40,890 read how much money they have from their account 421 00:20:40,890 --> 00:20:43,850 because it might be that there is a cache of the data that's 422 00:20:43,850 --> 00:20:45,940 in memory in the database. 423 00:20:45,940 --> 00:20:49,170 So it might allow read-only actions, in which case 424 00:20:49,170 --> 00:20:53,670 the system's perform is functioning 425 00:20:53,670 --> 00:20:55,720 with only a subset of the actions 426 00:20:55,720 --> 00:20:57,440 that it's supposed to be taking. 427 00:20:57,440 --> 00:20:59,460 And if that happens, that kind of failure 428 00:20:59,460 --> 00:21:07,560 is called a fail soft failure, where not all of the interfaces 429 00:21:07,560 --> 00:21:09,620 are available, but a subset of the interfaces 430 00:21:09,620 --> 00:21:11,120 are available and correctly working. 431 00:21:14,424 --> 00:21:16,340 And the last kind of failure that could happen 432 00:21:16,340 --> 00:21:21,334 is that in this example, let's say 433 00:21:21,334 --> 00:21:23,000 that failures are occurring when there's 434 00:21:23,000 --> 00:21:27,100 a large number of people trying to make these requests at ATMs. 435 00:21:27,100 --> 00:21:31,080 And, there is some problems that have arisen. 436 00:21:31,080 --> 00:21:33,840 And somebody determines that the problem 437 00:21:33,840 --> 00:21:35,640 arises when there is too many people 438 00:21:35,640 --> 00:21:38,465 gaining access to the system at the same time. 439 00:21:38,465 --> 00:21:40,090 And the system might now move to a mold 440 00:21:40,090 --> 00:21:42,980 where it allows only a small number of actions at a time, 441 00:21:42,980 --> 00:21:44,970 a small number of concurrent actions at a time, 442 00:21:44,970 --> 00:21:46,330 or maybe one action at a time. 443 00:21:46,330 --> 00:21:49,320 So, one user can come at a time to the system, which 444 00:21:49,320 --> 00:21:52,990 means the systems, there has been a failure, 445 00:21:52,990 --> 00:21:55,690 but the way the system's dealing with it is that it determines 446 00:21:55,690 --> 00:22:01,210 that the failure doesn't get triggered when the load is low. 447 00:22:01,210 --> 00:22:04,334 So it might function at low performance. 448 00:22:04,334 --> 00:22:06,000 It still provides all of the interfaces, 449 00:22:06,000 --> 00:22:09,190 but just at very low performance or at lower performance. 450 00:22:09,190 --> 00:22:11,315 And that kind of behavior is called failsafe. 451 00:22:14,120 --> 00:22:17,450 So it's moved to a mode where it's just scaled back 452 00:22:17,450 --> 00:22:19,310 how much work it's willing to do, 453 00:22:19,310 --> 00:22:21,365 and does it at degraded performance. 454 00:22:42,830 --> 00:22:49,030 OK, so the plan now is for the rest of today, 455 00:22:49,030 --> 00:22:51,130 so tomorrow from the next lecture on, 456 00:22:51,130 --> 00:22:53,470 what we're going to do is understand algorithms 457 00:22:53,470 --> 00:22:57,500 for how we go about and how you build systems that actually 458 00:22:57,500 --> 00:23:00,720 do one or all of these in order to meet the specification 459 00:23:00,720 --> 00:23:02,055 that we want. 460 00:23:02,055 --> 00:23:04,430 But before we do that you have to understand a little bit 461 00:23:04,430 --> 00:23:06,652 about models for faults. 462 00:23:06,652 --> 00:23:08,360 In order to build fault tolerant systems, 463 00:23:08,360 --> 00:23:10,420 it's usually a good idea to understand 464 00:23:10,420 --> 00:23:18,634 a little bit more quantitatively models or faults 465 00:23:18,634 --> 00:23:19,550 that occur in systems. 466 00:23:19,550 --> 00:23:21,010 And primarily, this discussion is 467 00:23:21,010 --> 00:23:23,590 going to be focused on hardware faults 468 00:23:23,590 --> 00:23:26,060 because most people don't understand how software 469 00:23:26,060 --> 00:23:27,682 faults are to be modeled. 470 00:23:27,682 --> 00:23:29,140 But since all our systems are going 471 00:23:29,140 --> 00:23:31,127 to be built on hardware, for example discs 472 00:23:31,127 --> 00:23:32,710 are going to be really, really common. 473 00:23:32,710 --> 00:23:34,418 Our network links are going to be common. 474 00:23:34,418 --> 00:23:36,990 And all of those conform nicely to models. 475 00:23:36,990 --> 00:23:39,230 It's worth understanding how that works. 476 00:23:39,230 --> 00:23:43,230 So, for example, a disk manufacturer 477 00:23:43,230 --> 00:23:47,750 might report that the error rate of undetected errors, so disks 478 00:23:47,750 --> 00:23:50,180 usually have a fair amount of error detection in them. 479 00:23:50,180 --> 00:23:53,320 But, they might report that the error rate of undetected errors 480 00:23:53,320 --> 00:23:56,100 is, say, ten to the minus 12 or ten to the minus 13. 481 00:23:56,100 --> 00:23:57,840 And that number looks really small. 482 00:23:57,840 --> 00:24:00,180 That says that out of that many bits, 483 00:24:00,180 --> 00:24:03,362 maybe one bit is corrupted, and you can't detect it. 484 00:24:03,362 --> 00:24:04,820 But, you have to realize that given 485 00:24:04,820 --> 00:24:07,650 modern workloads, for example take Google as an example 486 00:24:07,650 --> 00:24:09,757 that you saw from last recitation, 487 00:24:09,757 --> 00:24:12,340 the amount of data that's being stored in the system like that 488 00:24:12,340 --> 00:24:16,160 or in the world in general is so huge that a ten to the minus 489 00:24:16,160 --> 00:24:18,230 13th error rate means that you're probably 490 00:24:18,230 --> 00:24:21,570 seeing some bad data and file that you can never fix 491 00:24:21,570 --> 00:24:26,700 or never detect every couple of days. 492 00:24:26,700 --> 00:24:29,750 Network people would tell you that fiber optic 493 00:24:29,750 --> 00:24:32,994 links have an error rate of one error in ten to the 12th. 494 00:24:32,994 --> 00:24:35,160 But you have to realize that these links are sending 495 00:24:35,160 --> 00:24:37,940 so many gigabits per second that one error in ten 496 00:24:37,940 --> 00:24:39,980 to the 12th means something like there's 497 00:24:39,980 --> 00:24:46,204 an error that you can't detect maybe every couple of hours. 498 00:24:46,204 --> 00:24:48,370 What that really means is that at the higher layers, 499 00:24:48,370 --> 00:24:49,744 you need to do more work in order 500 00:24:49,744 --> 00:24:52,350 to make sure that your data is protected because you can't 501 00:24:52,350 --> 00:24:55,110 actually rely on the fact that your underlying components have 502 00:24:55,110 --> 00:24:57,680 these amazingly low numbers because there's so much data 503 00:24:57,680 --> 00:25:00,827 going on, being sent or being stored on these systems 504 00:25:00,827 --> 00:25:03,160 that you need to have other techniques at a higher layer 505 00:25:03,160 --> 00:25:05,359 to protect, if you really care about the integrity 506 00:25:05,359 --> 00:25:05,900 of your data. 507 00:25:08,430 --> 00:25:10,170 In addition to these raw numbers, 508 00:25:10,170 --> 00:25:12,770 there's two or three other metrics 509 00:25:12,770 --> 00:25:17,922 that people use to understand faults and failures. 510 00:25:17,922 --> 00:25:20,005 The first one is the number of tolerated failures. 511 00:25:25,330 --> 00:25:28,760 So, for example, if you build a system to store data 512 00:25:28,760 --> 00:25:32,550 and you're worried about discs failing or discs reporting 513 00:25:32,550 --> 00:25:35,010 at [earnest?] values, you might replicate that data across 514 00:25:35,010 --> 00:25:36,440 many, many discs. 515 00:25:36,440 --> 00:25:38,040 And then when you design your system, 516 00:25:38,040 --> 00:25:40,289 one of the things you would want to analyze and report 517 00:25:40,289 --> 00:25:42,830 is the number of tolerated failures of discs. 518 00:25:42,830 --> 00:25:46,350 So, for example, if you build a system out of seven discs, 519 00:25:46,350 --> 00:25:49,740 you might say that you can handle up to two failed disks, 520 00:25:49,740 --> 00:25:51,540 or simply like that, depending on how 521 00:25:51,540 --> 00:25:53,610 you've designed your system. 522 00:25:53,610 --> 00:25:56,160 And that's usually a good thing to report because then people 523 00:25:56,160 --> 00:25:58,300 who use your system can know how to provision 524 00:25:58,300 --> 00:26:04,260 or engineer your system. 525 00:26:04,260 --> 00:26:06,360 The second metric which we're going 526 00:26:06,360 --> 00:26:08,619 to spend a few more minutes on is something called 527 00:26:08,619 --> 00:26:09,660 the mean time to failure. 528 00:26:21,400 --> 00:26:24,750 And what this says is it takes a model where 529 00:26:24,750 --> 00:26:26,880 you have a system that starts at time zero, 530 00:26:26,880 --> 00:26:27,800 and it's running fine. 531 00:26:27,800 --> 00:26:30,190 And then at some point in time, it fails. 532 00:26:30,190 --> 00:26:33,570 And then, when it fails, that error 533 00:26:33,570 --> 00:26:36,660 is made known to an operator. 534 00:26:36,660 --> 00:26:38,090 Or it's made known to some higher 535 00:26:38,090 --> 00:26:40,420 level that has a plan to work around it 536 00:26:40,420 --> 00:26:42,232 to repair this failure. 537 00:26:42,232 --> 00:26:43,940 And then, once the failure gets repaired, 538 00:26:43,940 --> 00:26:46,064 it takes some time for the failure to get repaired. 539 00:26:46,064 --> 00:26:48,240 And once it's repaired it starts running again. 540 00:26:48,240 --> 00:26:51,100 And then it fails at some other point in the future. 541 00:26:51,100 --> 00:26:53,640 And when it goes through the cycle of failures and repairs, 542 00:26:53,640 --> 00:26:57,300 you end up with a timeline that looks like this. 543 00:26:57,300 --> 00:27:00,160 So you might start at time zero. 544 00:27:00,160 --> 00:27:03,450 And the system is working fine. 545 00:27:03,450 --> 00:27:06,350 And then there is a failure that happens here. 546 00:27:06,350 --> 00:27:09,820 And then the system is down for a certain period of time. 547 00:27:09,820 --> 00:27:13,560 And then somebody repairs the system, 548 00:27:13,560 --> 00:27:15,902 and then it continues to work. 549 00:27:15,902 --> 00:27:17,360 And then it fails again, and so on. 550 00:27:27,050 --> 00:27:28,860 And so, for each of the durations of time 551 00:27:28,860 --> 00:27:30,520 that the system is working, let's 552 00:27:30,520 --> 00:27:32,730 assume it's starting at zero, each of these 553 00:27:32,730 --> 00:27:35,460 defines a period of time that I'm going 554 00:27:35,460 --> 00:27:38,090 to call TTF or time to fail. 555 00:27:38,090 --> 00:27:41,780 OK, so this is the first time to fail interval. 556 00:27:41,780 --> 00:27:53,710 This is the second time to fail This is the second time 557 00:27:53,710 --> 00:27:55,710 to fail interval, time to repair, 558 00:27:55,710 --> 00:27:59,770 and this is the third time to fail interval, and so on. 559 00:27:59,770 --> 00:28:02,320 And analogously in between here, I 560 00:28:02,320 --> 00:28:04,330 could define these time to repair intervals, 561 00:28:04,330 --> 00:28:09,940 TTR1, TTR2, and so on. 562 00:28:09,940 --> 00:28:13,380 So, the mean time to failure is just the mean of these values. 563 00:28:13,380 --> 00:28:14,840 There's some duration of time. 564 00:28:14,840 --> 00:28:16,298 You're like, three hours the system 565 00:28:16,298 --> 00:28:18,330 worked, and then it crashed. 566 00:28:18,330 --> 00:28:19,220 That's TTF1. 567 00:28:19,220 --> 00:28:22,020 And then, somebody to repair it worked now for six hours. 568 00:28:22,020 --> 00:28:23,300 That's TTF2, and so on. 569 00:28:23,300 --> 00:28:26,140 If you run your system for a long enough period of time 570 00:28:26,140 --> 00:28:28,750 like a disk or anything else, and then you 571 00:28:28,750 --> 00:28:33,190 observe all these time to fail samples, and take 572 00:28:33,190 --> 00:28:36,870 the mean of that, that tells you a mean time to failure. 573 00:28:36,870 --> 00:28:38,300 The reason this is interesting is 574 00:28:38,300 --> 00:28:40,960 that you could run your system for a really long period 575 00:28:40,960 --> 00:28:44,620 of time, and build up a metric called availability. 576 00:28:49,800 --> 00:28:52,380 So, for example, if you're running a website, 577 00:28:52,380 --> 00:28:54,760 and the way this website works is it 578 00:28:54,760 --> 00:28:57,640 runs for a while and then every once in awhile it crashes. 579 00:28:57,640 --> 00:29:00,629 So its network crashes and people can't get to you. 580 00:29:00,629 --> 00:29:02,670 So you could run this for months or years on end, 581 00:29:02,670 --> 00:29:05,439 and then observe these values. 582 00:29:05,439 --> 00:29:06,730 You could run this every month. 583 00:29:06,730 --> 00:29:08,510 You could decide what availability is, and decide 584 00:29:08,510 --> 00:29:09,926 if it's good enough or if you want 585 00:29:09,926 --> 00:29:11,660 to make it higher or lower. 586 00:29:11,660 --> 00:29:14,310 So you could now define your availability 587 00:29:14,310 --> 00:29:18,540 to be the fraction of time that your system is up and running. 588 00:29:18,540 --> 00:29:21,050 And the fraction of time that the system is up and running 589 00:29:21,050 --> 00:29:23,250 is the fraction of time on this timeline 590 00:29:23,250 --> 00:29:29,240 that you have this kind of shaded thing. 591 00:29:29,240 --> 00:29:31,740 OK, so that's just equal to the sum of all the time 592 00:29:31,740 --> 00:29:35,970 to failure numbers divided by the total time. 593 00:29:35,970 --> 00:29:38,540 And the total time is just the sum of all 594 00:29:38,540 --> 00:29:40,260 the TTF's and the TTR's. 595 00:29:46,150 --> 00:29:49,880 OK, and that's what availability means is the fraction of time 596 00:29:49,880 --> 00:29:54,030 that your system is available is up. 597 00:29:54,030 --> 00:29:57,120 Now, if you divide both the top and the bottom by N, 598 00:29:57,120 --> 00:30:06,560 this number works out to be the mean time to failure 599 00:30:06,560 --> 00:30:10,410 divided by the mean time to failure plus the mean time 600 00:30:10,410 --> 00:30:10,930 to repair. 601 00:30:18,420 --> 00:30:20,890 So this is a useful notion because now it tells you that 602 00:30:20,890 --> 00:30:23,640 you can [point?] your system for a very long period of time, 603 00:30:23,640 --> 00:30:26,994 and build up a mean estimate, mean values of the time 604 00:30:26,994 --> 00:30:28,410 to failure and the time to repair, 605 00:30:28,410 --> 00:30:32,140 and just come up with the notion of what the availability 606 00:30:32,140 --> 00:30:33,020 of the system is. 607 00:30:33,020 --> 00:30:35,850 And then, decide based on whether it's high enough or not 608 00:30:35,850 --> 00:30:38,762 whether you want to improve some aspect of the system 609 00:30:38,762 --> 00:30:39,970 and whether it's worth doing. 610 00:30:44,540 --> 00:30:47,820 So it turns out this mean time to failure, 611 00:30:47,820 --> 00:30:52,840 and therefore availability is related for components 612 00:30:52,840 --> 00:30:56,300 to a notion called the failure rate. 613 00:30:56,300 --> 00:30:58,270 So let me define the failure rate. 614 00:31:09,200 --> 00:31:12,840 So the failure rate is defined, it's 615 00:31:12,840 --> 00:31:14,460 also called a hazard function. 616 00:31:14,460 --> 00:31:16,930 That's what people use the term H of T, the hazard rate. 617 00:31:16,930 --> 00:31:18,510 That's defined as the probability 618 00:31:18,510 --> 00:31:25,780 that you have a failure of a system or a component in time, 619 00:31:25,780 --> 00:31:31,950 T to T plus delta T, given that it's working, 620 00:31:31,950 --> 00:31:36,360 we'll say, OK, at time T, OK? 621 00:31:36,360 --> 00:31:37,910 So it's a conditional probability. 622 00:31:37,910 --> 00:31:40,160 It's the probability that you'll fail in the next time 623 00:31:40,160 --> 00:31:42,702 instant given that it's correctly working and has 624 00:31:42,702 --> 00:31:43,660 been correctly working. 625 00:31:43,660 --> 00:31:46,120 It's correctly working at time T. 626 00:31:46,120 --> 00:31:48,720 So, if you look at this for a disk, 627 00:31:48,720 --> 00:31:52,270 most discs look like the picture shown up here. 628 00:31:52,270 --> 00:31:53,900 This is also called the bathtub curve 629 00:31:53,900 --> 00:31:56,480 because it looks like a bathtub. 630 00:31:56,480 --> 00:31:58,960 What you see at the left end here are new discs. 631 00:31:58,960 --> 00:32:01,490 So, the X axis here shows time. 632 00:32:01,490 --> 00:32:03,319 I guess it's a little shifted below. 633 00:32:03,319 --> 00:32:05,610 You can't read some stuff that's written at the bottom. 634 00:32:05,610 --> 00:32:07,840 But the X axis shows time, and the Y axis 635 00:32:07,840 --> 00:32:09,300 shows the failure rate. 636 00:32:09,300 --> 00:32:11,740 So, when you take a new component like a new light bulb 637 00:32:11,740 --> 00:32:14,970 or a new disc or anything new, there is a pretty high chance 638 00:32:14,970 --> 00:32:18,730 that it'll actually fail because manufacturers, when they sell 639 00:32:18,730 --> 00:32:21,120 you stuff, don't actually sell you things 640 00:32:21,120 --> 00:32:23,411 without actually burning them in first. 641 00:32:23,411 --> 00:32:25,410 So for semiconductors, that's also called yield. 642 00:32:25,410 --> 00:32:27,160 They make a whole number of chips, 643 00:32:27,160 --> 00:32:28,860 and then they're burning a few, and then 644 00:32:28,860 --> 00:32:30,026 they only give you the rest. 645 00:32:32,480 --> 00:32:34,680 And the fraction that survives the burning 646 00:32:34,680 --> 00:32:36,570 is also called the yield. 647 00:32:36,570 --> 00:32:39,180 So what you see on our left, the colorful term for that 648 00:32:39,180 --> 00:32:41,330 is infant mortality. 649 00:32:41,330 --> 00:32:45,270 So it's things that die when they are really, really young. 650 00:32:45,270 --> 00:32:48,220 And then, once you get past the early mortality, 651 00:32:48,220 --> 00:32:52,660 you end up with a flat failure, a conditional probability 652 00:32:52,660 --> 00:32:53,950 for failure. 653 00:32:53,950 --> 00:32:57,100 And what this says is that, and I'll 654 00:32:57,100 --> 00:32:58,490 get to this and a little bit. 655 00:32:58,490 --> 00:33:01,370 But what this says is that once you are in the flat region, 656 00:33:01,370 --> 00:33:03,360 it says that the probability of failure 657 00:33:03,360 --> 00:33:06,110 is essentially independent of what's happened in the past. 658 00:33:09,240 --> 00:33:11,170 And then you stay here for a while. 659 00:33:11,170 --> 00:33:13,721 And then if the system has been operating 660 00:33:13,721 --> 00:33:15,470 like a disk has been operating for awhile, 661 00:33:15,470 --> 00:33:17,970 let's say three years or five years typically for discs, 662 00:33:17,970 --> 00:33:19,590 then the probability of failure starts 663 00:33:19,590 --> 00:33:22,630 going up again because that's usually due to wear and tear, 664 00:33:22,630 --> 00:33:26,750 which for hardware components is certainly the case. 665 00:33:28,724 --> 00:33:30,390 There are a couple of interesting things 666 00:33:30,390 --> 00:33:33,640 about this curve that you should realize, particularly 667 00:33:33,640 --> 00:33:37,120 when you read specifications for things like discs. 668 00:33:37,120 --> 00:33:39,540 Disc manufacturers will report a number, 669 00:33:39,540 --> 00:33:41,982 like the mean time to failure number. 670 00:33:41,982 --> 00:33:43,440 And the mean time to failure number 671 00:33:43,440 --> 00:33:44,981 that the report might usually, I mean 672 00:33:44,981 --> 00:33:48,767 for discs might be 200,000 hours or 300,000 hours. 673 00:33:48,767 --> 00:33:50,600 I mean, that's a really long period of time. 674 00:33:50,600 --> 00:33:52,780 That's 30 years. 675 00:33:52,780 --> 00:33:54,460 So when you look at a number like that, 676 00:33:54,460 --> 00:33:56,750 you have to ask whether what it means 677 00:33:56,750 --> 00:33:58,390 is that discs really survive 30 years. 678 00:33:58,390 --> 00:34:00,970 And anybody who is on the computer 679 00:34:00,970 --> 00:34:06,750 knows, you know, most discs don't survive 30 years. 680 00:34:06,750 --> 00:34:09,250 So they are actually reporting one 681 00:34:09,250 --> 00:34:12,219 over the reciprocal of this thing at the flat region 682 00:34:12,219 --> 00:34:14,830 of the curve because this conditional failure probability 683 00:34:14,830 --> 00:34:19,750 rate, at this operation time when the only reason 684 00:34:19,750 --> 00:34:21,940 things fail is completely random failures 685 00:34:21,940 --> 00:34:24,060 not related to wear and tear. 686 00:34:24,060 --> 00:34:26,120 So when disc manufacturers report a mean time 687 00:34:26,120 --> 00:34:28,520 to failure number, they are actually reporting something 688 00:34:28,520 --> 00:34:33,719 that that's the time that you're disc is likely to work. 689 00:34:33,719 --> 00:34:37,710 What that number really says is that during the period of time 690 00:34:37,710 --> 00:34:39,550 that the disc is normally working, 691 00:34:39,550 --> 00:34:41,989 the probability of a random failure 692 00:34:41,989 --> 00:34:45,345 is one over the mean time to failure. 693 00:34:45,345 --> 00:34:46,469 That's what it really says. 694 00:34:46,469 --> 00:34:49,960 So the other number that they also report, often 695 00:34:49,960 --> 00:34:53,462 in smaller print, is it the expected operational lifetime. 696 00:34:53,462 --> 00:34:55,920 And that's usually something like three years or four years 697 00:34:55,920 --> 00:34:58,310 or five years, whatever it is they report. 698 00:34:58,310 --> 00:35:00,490 And that's where this thing starts going up, 699 00:35:00,490 --> 00:35:04,066 and beyond a point where the probability of failures 700 00:35:04,066 --> 00:35:05,440 above some threshold, they report 701 00:35:05,440 --> 00:35:09,410 that as the expected operational lifetime. 702 00:35:09,410 --> 00:35:12,390 Now, for software, this curve doesn't actually apply, 703 00:35:12,390 --> 00:35:16,119 or at least nobody really knows what the curve is for software. 704 00:35:16,119 --> 00:35:18,410 What is true for software, though, is infant mortality, 705 00:35:18,410 --> 00:35:20,493 things were the conditional probability of failure 706 00:35:20,493 --> 00:35:23,420 is high for new software, which is why you are sort of well 707 00:35:23,420 --> 00:35:25,800 advised, the moment the new upgrade of something 708 00:35:25,800 --> 00:35:28,250 comes around, most people who are prudent 709 00:35:28,250 --> 00:35:31,352 wait a little bit to just make sure all the bugs are out, 710 00:35:31,352 --> 00:35:32,810 and things get a little bit stable. 711 00:35:32,810 --> 00:35:33,893 So they wait a few months. 712 00:35:33,893 --> 00:35:36,260 You are always a couple of revisions behind. 713 00:35:36,260 --> 00:35:39,300 So I do believe that for software, the left side 714 00:35:39,300 --> 00:35:40,530 of the curve holds. 715 00:35:40,530 --> 00:35:42,640 It's totally unclear that there is a flat region, 716 00:35:42,640 --> 00:35:44,210 and it's totally unclear that things 717 00:35:44,210 --> 00:35:50,240 start rising again with age. 718 00:35:50,240 --> 00:35:52,420 So the reason for this curve, being 719 00:35:52,420 --> 00:35:55,950 the way it is, is a lot of this is based on the fact 720 00:35:55,950 --> 00:35:58,060 that things are mechanical and have wear and tear. 721 00:35:58,060 --> 00:35:59,830 But the motivation for this kind of curve 722 00:35:59,830 --> 00:36:04,120 actually comes from demographics and from human lifespans. 723 00:36:04,120 --> 00:36:07,380 So this is a picture that I got from, 724 00:36:07,380 --> 00:36:10,030 it's a website called mortality.org, 725 00:36:10,030 --> 00:36:17,010 which is a research project run by demographers. 726 00:36:17,010 --> 00:36:18,430 And they have amazing data. 727 00:36:18,430 --> 00:36:21,460 There's way more data available but human life expectancy 728 00:36:21,460 --> 00:36:24,800 and demographics than anything about software. 729 00:36:24,800 --> 00:36:27,100 What this shows here is actually the same bathtub curve 730 00:36:27,100 --> 00:36:28,360 as in the previous chart. 731 00:36:28,360 --> 00:36:30,480 It just doesn't look like that because the Y 732 00:36:30,480 --> 00:36:32,940 axis is on a log scale. 733 00:36:32,940 --> 00:36:35,280 So given that it's rising linearly 734 00:36:35,280 --> 00:36:38,730 between 0.001 and 0.01, on a linear scale that 735 00:36:38,730 --> 00:36:40,390 looks essentially flat. 736 00:36:40,390 --> 00:36:43,175 So human beings for the probability of death, 737 00:36:43,175 --> 00:36:44,550 at a certain time, given that you 738 00:36:44,550 --> 00:36:44,660 are alive at a certain time, that 739 00:36:44,660 --> 00:36:46,868 follows this curve here, essentially a bathtub curve. 740 00:36:46,868 --> 00:36:49,520 At the left hand, of course, there is infant mortality. 741 00:36:49,520 --> 00:36:53,100 I think I pulled the data down. 742 00:36:53,100 --> 00:36:55,539 I think I pulled the data down. 743 00:36:55,539 --> 00:36:57,080 This is from an article that appeared 744 00:36:57,080 --> 00:37:00,380 where the data for the US population 1999. 745 00:37:00,380 --> 00:37:02,720 It starts off again with infant mortality. 746 00:37:02,720 --> 00:37:04,890 And then it's flat for a while. 747 00:37:04,890 --> 00:37:07,114 Then it rises up. 748 00:37:07,114 --> 00:37:08,530 Now, there's a lot of controversy, 749 00:37:08,530 --> 00:37:12,230 it turns out, for whether the bathtub curve at the right end 750 00:37:12,230 --> 00:37:13,520 holds for human beings or not. 751 00:37:13,520 --> 00:37:14,895 And, some people believe it does, 752 00:37:14,895 --> 00:37:16,490 and some people believe it doesn't. 753 00:37:16,490 --> 00:37:18,720 But the point here is that for human beings anyway, 754 00:37:18,720 --> 00:37:22,470 the rule of thumb that insurance companies use for determining 755 00:37:22,470 --> 00:37:27,200 insurance premiums is that the log of the death 756 00:37:27,200 --> 00:37:30,740 rate, the log of the probability of dying in a certain age 757 00:37:30,740 --> 00:37:35,986 grows linearly with the time that somebody has been alive. 758 00:37:35,986 --> 00:37:37,360 And that's what this graph shows, 759 00:37:37,360 --> 00:37:40,710 that on [large?] scale on the Y axis, you have a line. 760 00:37:40,710 --> 00:37:43,950 And that's what they use for determining insurance premiums. 761 00:37:47,310 --> 00:37:51,700 OK, so the reason this bathtub curve is actually useful 762 00:37:51,700 --> 00:37:56,180 is, so if you go back, let's go back here. 763 00:37:56,180 --> 00:37:57,890 The reason both these numbers are useful, 764 00:37:57,890 --> 00:37:59,520 the flat portion of the bathtub curve 765 00:37:59,520 --> 00:38:02,190 and the expected operational lifetime is the following. 766 00:38:02,190 --> 00:38:04,090 It's not like this flat portion of the curve 767 00:38:04,090 --> 00:38:06,360 where the disc manufacturer reports the mean time 768 00:38:06,360 --> 00:38:06,860 to failure. 769 00:38:06,860 --> 00:38:08,000 That's 30 years. 770 00:38:08,000 --> 00:38:10,384 It's not like that's useless even 771 00:38:10,384 --> 00:38:12,800 though you're disc only might run for three to four years. 772 00:38:12,800 --> 00:38:17,390 The reason is that if you have a project, if you have 773 00:38:17,390 --> 00:38:19,850 a system where you are willing to upgrade 774 00:38:19,850 --> 00:38:22,020 your disk every three years where you've budgeted 775 00:38:22,020 --> 00:38:26,070 for upgrading your discs every said three years, then 776 00:38:26,070 --> 00:38:30,290 you might be better off buying a disk whose expected lifetime is 777 00:38:30,290 --> 00:38:36,230 only five years but whose flat portion is really low. 778 00:38:36,230 --> 00:38:40,050 So in particular, if you're given to discs, one of which 779 00:38:40,050 --> 00:38:41,970 has a curve that looks like that, 780 00:38:41,970 --> 00:38:49,350 and another that has a curve that looks like that, and let's 781 00:38:49,350 --> 00:38:55,410 say this is five years, and this is three years. 782 00:38:55,410 --> 00:38:57,290 If you're building a system and you've 783 00:38:57,290 --> 00:38:59,740 budgeted for upgrading your discs every four years, 784 00:38:59,740 --> 00:39:03,080 then you're probably better off using the thing 785 00:39:03,080 --> 00:39:05,510 with the lower value of mean time 786 00:39:05,510 --> 00:39:07,822 to failure because its expected lifetime is longer. 787 00:39:07,822 --> 00:39:10,030 But if you're willing to upgrade your discs every two 788 00:39:10,030 --> 00:39:12,730 years or one year, then you might 789 00:39:12,730 --> 00:39:15,225 be better off with this thing here with the lower meantime 790 00:39:15,225 --> 00:39:17,600 to failure, even though its expected operational lifetime 791 00:39:17,600 --> 00:39:18,240 is smaller. 792 00:39:18,240 --> 00:39:20,300 So both of these numbers are actually meaningful, 793 00:39:20,300 --> 00:39:24,111 and it depends a lot on how you're planning to use it. 794 00:39:24,111 --> 00:39:26,110 I mean, it's a lot like spare tires on your car. 795 00:39:26,110 --> 00:39:28,310 I mean, the spare tire was run perfectly fine 796 00:39:28,310 --> 00:39:30,560 as long as you don't exceed 100 miles. 797 00:39:30,560 --> 00:39:32,280 And the moment you exceed 100 miles, then 798 00:39:32,280 --> 00:39:33,750 you don't want to use it at all. 799 00:39:33,750 --> 00:39:35,650 And it might be a lot cheaper to build the spare tire 800 00:39:35,650 --> 00:39:37,660 that runs just 100 miles because the users, you 801 00:39:37,660 --> 00:39:45,910 are guaranteed that you will get to a repair shop 802 00:39:45,910 --> 00:39:53,160 is in 100 miles. 803 00:39:53,160 --> 00:40:00,400 It's the same concept. 804 00:40:00,400 --> 00:40:02,024 OK. 805 00:40:02,024 --> 00:40:03,690 So one of the things that we can define, 806 00:40:03,690 --> 00:40:06,140 once we have this condition of failure rate 807 00:40:06,140 --> 00:40:08,820 is the reliability of the system. 808 00:40:12,180 --> 00:40:16,597 We'll define that as the probability, R of T, 809 00:40:16,597 --> 00:40:18,430 is the probability that the system's working 810 00:40:18,430 --> 00:40:22,520 at time T, given that it was working at time zero, 811 00:40:22,520 --> 00:40:24,840 or more generally assuming that everything is always 812 00:40:24,840 --> 00:40:27,610 working at time zero, it's the probability 813 00:40:27,610 --> 00:40:37,220 that you're OK at time T. 814 00:40:37,220 --> 00:40:40,630 And it turns out that for components in the flat region 815 00:40:40,630 --> 00:40:45,160 of this curve, H of T, the conditional failure rate is 816 00:40:45,160 --> 00:40:50,260 a constant, on systems that satisfy that, 817 00:40:50,260 --> 00:40:53,020 and would satisfy the property that the actual unconditional 818 00:40:53,020 --> 00:40:56,070 failure rate is a [memory-less?] process where the probability 819 00:40:56,070 --> 00:40:58,980 of failure doesn't depend on how long the system's been running. 820 00:40:58,980 --> 00:41:00,563 It turns out that for the systems that 821 00:41:00,563 --> 00:41:03,480 satisfy those conditions, which apparently discs 822 00:41:03,480 --> 00:41:06,900 do in the operation when they're actually 823 00:41:06,900 --> 00:41:09,730 not at the right edge of the curve, which discs do, 824 00:41:09,730 --> 00:41:13,170 the reliability, this function goes 825 00:41:13,170 --> 00:41:15,520 as the very nice, simple function, which 826 00:41:15,520 --> 00:41:18,580 is an exponential decaying function, E to the minus 827 00:41:18,580 --> 00:41:20,200 T over MTTF. 828 00:41:20,200 --> 00:41:21,710 And this is under two conditions. 829 00:41:21,710 --> 00:41:24,340 H of T has to be flat, and the unconditional failure rate 830 00:41:24,340 --> 00:41:26,256 has to be something that doesn't depend on how 831 00:41:26,256 --> 00:41:27,880 long the system's been running. 832 00:41:27,880 --> 00:41:30,570 And for those systems, it's not hard to show 833 00:41:30,570 --> 00:41:33,110 that your reliability is just an exponential decaying 834 00:41:33,110 --> 00:41:35,110 function, which means you can do a lot of things 835 00:41:35,110 --> 00:41:37,530 like predict how long the system is likely to be running, 836 00:41:37,530 --> 00:41:39,490 and so on. 837 00:41:39,490 --> 00:41:43,760 And that will tell you when to upgrade things. 838 00:41:43,760 --> 00:41:48,090 OK, so given all of this stuff, we now 839 00:41:48,090 --> 00:41:50,610 want techniques to cope with failures, cope with faults. 840 00:41:50,610 --> 00:41:52,318 And that's what we're going to be looking 841 00:41:52,318 --> 00:41:55,100 at for the next few lectures, let's 842 00:41:55,100 --> 00:42:00,537 take one simple example of a system first. 843 00:42:00,537 --> 00:42:02,370 And like I said before, all of these systems 844 00:42:02,370 --> 00:42:04,390 use redundancy in some form. 845 00:42:04,390 --> 00:42:06,650 So the disk fails at a certain rate. 846 00:42:06,650 --> 00:42:09,510 Just put in multiple disks, replicate the data across them, 847 00:42:09,510 --> 00:42:10,940 and then hope that things survive. 848 00:42:14,280 --> 00:42:17,170 So the first kind of redundancy that you 849 00:42:17,170 --> 00:42:20,080 might have in the example that I just 850 00:42:20,080 --> 00:42:24,130 talked about, spatial redundancy, where the idea is 851 00:42:24,130 --> 00:42:27,270 that you have multiple copies of the same thing, 852 00:42:27,270 --> 00:42:29,790 and the games we're going to play all 853 00:42:29,790 --> 00:42:33,790 have to do with how we're going to manage all these copies. 854 00:42:33,790 --> 00:42:37,180 And actually, this will turn out to be quite complicated. 855 00:42:37,180 --> 00:42:40,360 We'll use these special copies in a number of different ways. 856 00:42:40,360 --> 00:42:43,420 In some examples, we'll apply error correcting codes 857 00:42:43,420 --> 00:42:48,790 to make copies of the data or use other codes to replicate 858 00:42:48,790 --> 00:42:49,290 the data. 859 00:42:52,660 --> 00:42:54,970 We might replicate data and make copies 860 00:42:54,970 --> 00:42:57,219 of data in the form of logs which keep track of, 861 00:42:57,219 --> 00:42:58,510 you know, you run an operation. 862 00:42:58,510 --> 00:42:59,780 You store some results. 863 00:42:59,780 --> 00:43:02,100 But at the same time, before you store those results, 864 00:43:02,100 --> 00:43:04,840 you also store something in a log [surf?], 865 00:43:04,840 --> 00:43:07,200 the original data went away; your log can tell you what 866 00:43:07,200 --> 00:43:09,540 to do. 867 00:43:09,540 --> 00:43:13,300 Or you might just do plain and simple copies 868 00:43:13,300 --> 00:43:18,030 followed by voting. 869 00:43:18,030 --> 00:43:21,550 So the idea is that you have multiple copies of something, 870 00:43:21,550 --> 00:43:23,590 and then you write to all of them. 871 00:43:23,590 --> 00:43:25,350 In the simplest of schemes, you might write to all of them, 872 00:43:25,350 --> 00:43:26,740 and then when you want to read something, 873 00:43:26,740 --> 00:43:28,110 you read from all of them. 874 00:43:28,110 --> 00:43:30,350 And then just what? 875 00:43:30,350 --> 00:43:31,690 And go with the majority. 876 00:43:31,690 --> 00:43:35,339 So intuitively that can tolerate a certain number of failures. 877 00:43:35,339 --> 00:43:37,130 And all of these approaches have been used. 878 00:43:37,130 --> 00:43:39,690 And people will continue to build systems 879 00:43:39,690 --> 00:43:41,930 along all of these ideas. 880 00:43:41,930 --> 00:43:44,270 But in addition, we're also going 881 00:43:44,270 --> 00:43:48,460 to look at temporal redundancy. 882 00:43:48,460 --> 00:43:51,850 And the idea here is try it again. 883 00:43:51,850 --> 00:43:54,890 So this is different from copies. 884 00:43:54,890 --> 00:43:56,440 What it says is you try something. 885 00:43:56,440 --> 00:43:58,856 If it doesn't work and you determine that it doesn't work, 886 00:43:58,856 --> 00:44:01,570 try it again. 887 00:44:01,570 --> 00:44:05,040 So retry is an example of temporal tricks. 888 00:44:05,040 --> 00:44:08,070 But it will turn out will also use not just moving forward 889 00:44:08,070 --> 00:44:10,840 and retrying something that we know should be retried, 890 00:44:10,840 --> 00:44:12,700 we'll also use the trick of undoing things 891 00:44:12,700 --> 00:44:14,130 that we have done. 892 00:44:14,130 --> 00:44:16,860 So we'll move both directions on the time axis. 893 00:44:16,860 --> 00:44:19,170 We'll retry stuff, but at the same time 894 00:44:19,170 --> 00:44:22,630 we'll also undo things because sometimes things have happened 895 00:44:22,630 --> 00:44:24,280 that shouldn't have happened. 896 00:44:24,280 --> 00:44:25,380 Things went half way. 897 00:44:25,380 --> 00:44:28,170 And we really want to back things out. 898 00:44:28,170 --> 00:44:31,725 And we were going to use both of these techniques. 899 00:44:35,980 --> 00:44:41,830 So one example of spatial redundancy is a voting scheme. 900 00:44:46,360 --> 00:44:50,730 And you can apply this to many different kinds of systems. 901 00:44:50,730 --> 00:44:53,800 But let's just apply it to a simple example of, 902 00:44:53,800 --> 00:44:55,970 there is data stored in multiple occasions. 903 00:44:55,970 --> 00:44:57,624 And then whenever data is written, 904 00:44:57,624 --> 00:44:58,790 it's written to all of them. 905 00:44:58,790 --> 00:45:01,160 And then when you read it, you read from all them, 906 00:45:01,160 --> 00:45:02,530 and then you vote. 907 00:45:05,700 --> 00:45:07,960 And in a simple model where these components are 908 00:45:07,960 --> 00:45:10,660 fail stop, which means that they fail; they just fail. 909 00:45:13,210 --> 00:45:17,910 Excuse me, on a simple model where things are not 910 00:45:17,910 --> 00:45:21,550 fail stop or fail fast, but just report back this data to you, 911 00:45:21,550 --> 00:45:24,035 so you are voting on it, and these results come back. 912 00:45:24,035 --> 00:45:26,660 You've written something in them and when you read things back, 913 00:45:26,660 --> 00:45:29,180 arbitrary values might get returned if there's a failure. 914 00:45:29,180 --> 00:45:31,830 And if there's no failure here, correct values get returned. 915 00:45:31,830 --> 00:45:34,572 Then as long as two of these copies are correctly working, 916 00:45:34,572 --> 00:45:36,530 or two of these versions are correctly working, 917 00:45:36,530 --> 00:45:38,238 then the vote will actually return to you 918 00:45:38,238 --> 00:45:40,034 at the correct output. 919 00:45:40,034 --> 00:45:41,450 And that's the idea behind voting. 920 00:45:41,450 --> 00:45:46,500 So if the reliability of each of these components is some R, 921 00:45:46,500 --> 00:45:49,070 that's the probability that the system's working at time T 922 00:45:49,070 --> 00:45:51,814 according to that definition of reliability. 923 00:45:51,814 --> 00:45:53,980 Then, under the assumption that these are completely 924 00:45:53,980 --> 00:45:55,810 independent of each other, which is a big assumption, 925 00:45:55,810 --> 00:45:57,250 particularly for software. 926 00:45:57,250 --> 00:45:59,590 But it might be a reasonable assumption for something 927 00:45:59,590 --> 00:46:01,430 like a disk, under the assumption that these 928 00:46:01,430 --> 00:46:04,080 are completely independent, then you could write out 929 00:46:04,080 --> 00:46:07,180 the reliability of this three-voting scheme 930 00:46:07,180 --> 00:46:11,212 of this thing where you are voting on three outputs. 931 00:46:11,212 --> 00:46:12,920 But you know that the system is correctly 932 00:46:12,920 --> 00:46:16,230 working if any two of these are correctly working. 933 00:46:16,230 --> 00:46:18,660 So that happens under two conditions. 934 00:46:18,660 --> 00:46:22,930 Firstly, all three are correctly working, right? 935 00:46:22,930 --> 00:46:25,050 Or, some two of the three are correctly working. 936 00:46:25,050 --> 00:46:27,466 And, there's three ways in which you could choose some two 937 00:46:27,466 --> 00:46:29,410 of the three. 938 00:46:29,410 --> 00:46:33,020 And, one of them is wrongly working. 939 00:46:33,020 --> 00:46:34,870 And it turns out that this number actually 940 00:46:34,870 --> 00:46:39,870 is very, very large, much larger than R, when R is close to one. 941 00:46:39,870 --> 00:46:44,140 And, in general, this is bigger than R 942 00:46:44,140 --> 00:46:47,170 when each of the components has high enough reliability, 943 00:46:47,170 --> 00:46:48,290 namely, bigger than half. 944 00:46:54,650 --> 00:46:57,390 And so, let's say that each of these components 945 00:46:57,390 --> 00:47:00,670 has a reliability of 95%. 946 00:47:00,670 --> 00:47:02,830 If you work this number out, it turns out 947 00:47:02,830 --> 00:47:04,830 to be a pretty big number, much higher than 95%, 948 00:47:04,830 --> 00:47:06,810 much closer to one. 949 00:47:06,810 --> 00:47:08,884 And, of course, this kind of voting 950 00:47:08,884 --> 00:47:11,050 is a bad idea if the reliability of these components 951 00:47:11,050 --> 00:47:11,840 is really low. 952 00:47:11,840 --> 00:47:14,492 I mean, if it's below one half, then chances are that 953 00:47:14,492 --> 00:47:16,700 you're more likely the two of them are just [wrong?], 954 00:47:16,700 --> 00:47:18,170 and you agree on that result. 955 00:47:18,170 --> 00:47:22,780 And it turns out to reduce the reliability of the system. 956 00:47:22,780 --> 00:47:26,170 Now, in general, you might think that you can build systems out 957 00:47:26,170 --> 00:47:30,010 of this basic voting idea, and for various reasons 958 00:47:30,010 --> 00:47:32,310 it turns out that this idea has limited applicability 959 00:47:32,310 --> 00:47:34,840 for the kinds of things we want to do. 960 00:47:34,840 --> 00:47:36,950 And a lot of that stems from the fact 961 00:47:36,950 --> 00:47:39,649 that these are not, in general, in computer systems. 962 00:47:39,649 --> 00:47:41,940 It's very hard to design components that are completely 963 00:47:41,940 --> 00:47:43,770 independent of each other. 964 00:47:43,770 --> 00:47:47,430 It might work out OK for certain hardware components where 965 00:47:47,430 --> 00:47:49,480 you might do this voting or other forms 966 00:47:49,480 --> 00:47:51,980 of spatial redundancy that gives you 967 00:47:51,980 --> 00:47:55,130 these impressive reliability numbers. 968 00:47:55,130 --> 00:47:57,490 But for software, this independent assumption 969 00:47:57,490 --> 00:47:59,230 turns out to be really hard to meet. 970 00:47:59,230 --> 00:48:01,230 And there is an approach to building software like this. 971 00:48:01,230 --> 00:48:02,646 It's called N version programming. 972 00:48:02,646 --> 00:48:04,550 And it's still a topic of research 973 00:48:04,550 --> 00:48:08,681 where people are trying to build software systems out of voting. 974 00:48:08,681 --> 00:48:11,180 But you have to pay a lot of attention and care to make sure 975 00:48:11,180 --> 00:48:13,270 that these software components that 976 00:48:13,270 --> 00:48:16,030 are doing the same function are actually independent, 977 00:48:16,030 --> 00:48:18,400 maybe written by different people running 978 00:48:18,400 --> 00:48:20,320 on different operating systems, and so on. 979 00:48:20,320 --> 00:48:23,800 And that turns out to be a pretty expensive undertaking. 980 00:48:23,800 --> 00:48:25,510 It's still sometimes necessary if you 981 00:48:25,510 --> 00:48:27,400 want to build something highly reliable. 982 00:48:27,400 --> 00:48:31,620 But because of its cost it's not something 983 00:48:31,620 --> 00:48:34,190 that is the sort of cookie-cutter technique 984 00:48:34,190 --> 00:48:36,862 for achieving highly reliable software systems. 985 00:48:36,862 --> 00:48:39,070 And so what we're going to see starting for next time 986 00:48:39,070 --> 00:48:40,986 is a somewhat different approach for achieving 987 00:48:40,986 --> 00:48:43,742 software reliability that doesn't rely on voting, which 988 00:48:43,742 --> 00:48:45,950 won't actually achieve the same degree of reliability 989 00:48:45,950 --> 00:48:49,642 as these kinds of systems, but will achieve 990 00:48:49,642 --> 00:48:51,600 a different kind of reliability that we'll talk 991 00:48:51,600 --> 00:48:54,310 about starting from next time.