1
00:00:00,000 --> 00:00:11,740
-- is the next group of topics
in 6.033 call fault tolerance.

2
00:00:11,740 --> 00:00:16,900
And the goal here is to learn
how to build reliable systems.

3
00:00:16,900 --> 00:00:19,730
An extreme case, or at
least our ideal goal,

4
00:00:19,730 --> 00:00:22,310
is to try to build systems
that will never fail.

5
00:00:22,310 --> 00:00:24,490
And what will find is that
we really can't do that,

6
00:00:24,490 --> 00:00:26,250
but what we'll try
to do is to build

7
00:00:26,250 --> 00:00:31,560
systems which maybe fail less
often than if you built them

8
00:00:31,560 --> 00:00:35,041
without the principles that
we're going to talk about.

9
00:00:35,041 --> 00:00:36,916
So the idea is how to
build reliable systems.

10
00:00:41,080 --> 00:00:43,610
So in order to understand how
to build reliable systems,

11
00:00:43,610 --> 00:00:46,460
we need to understand what
makes systems unreliable.

12
00:00:46,460 --> 00:00:52,260
And that has to do with
understanding what faults are.

13
00:00:52,260 --> 00:00:57,170
What problems occur in systems
that cause systems to fail?

14
00:00:57,170 --> 00:01:00,200
And you've actually seen many
examples of faults already.

15
00:01:00,200 --> 00:01:02,340
Informally, a fault
is just some kind

16
00:01:02,340 --> 00:01:08,151
of a flaw or a mistake that
causes a component or a module

17
00:01:08,151 --> 00:01:10,150
not to perform the way
it's supposed to perform.

18
00:01:10,150 --> 00:01:12,530
And we'll formalize
this notion a little bit

19
00:01:12,530 --> 00:01:14,690
today as we go along.

20
00:01:14,690 --> 00:01:17,290
So there are many examples
of faults, several of which

21
00:01:17,290 --> 00:01:18,705
you've already seen.

22
00:01:18,705 --> 00:01:20,580
A system could fail
because it has a software

23
00:01:20,580 --> 00:01:23,950
fault, a bug in a piece of
software, so when you run it,

24
00:01:23,950 --> 00:01:26,680
it doesn't work according
to the way you expect.

25
00:01:26,680 --> 00:01:30,560
And that causes
something bad to happen.

26
00:01:30,560 --> 00:01:34,540
You might have hardware faults.

27
00:01:34,540 --> 00:01:35,790
You store some data on a disk.

28
00:01:35,790 --> 00:01:36,660
You go back and read it.

29
00:01:36,660 --> 00:01:37,510
And it isn't there.

30
00:01:37,510 --> 00:01:39,020
But it's been corrupted.

31
00:01:39,020 --> 00:01:40,940
And that's an example
of a fault that

32
00:01:40,940 --> 00:01:42,700
might cause bad things
to happen if you

33
00:01:42,700 --> 00:01:48,690
build a system that relies on a
disk storing data persistently.

34
00:01:48,690 --> 00:01:54,500
You might have design faults
where a design fault might

35
00:01:54,500 --> 00:01:58,130
be something where
you try to, let's say,

36
00:01:58,130 --> 00:02:01,720
figure out how much buffering
to put into a network switch.

37
00:02:01,720 --> 00:02:03,200
And you put into
little buffering.

38
00:02:03,200 --> 00:02:05,616
So what ends up happening is
too many packets get dropped.

39
00:02:05,616 --> 00:02:09,610
So you might actually just
have some bad logic in there,

40
00:02:09,610 --> 00:02:13,730
and that causes you to design
something that isn't quite

41
00:02:13,730 --> 00:02:15,514
going to work out.

42
00:02:15,514 --> 00:02:17,680
And you might, of course,
have implementation faults

43
00:02:17,680 --> 00:02:19,650
where you have a design,
and then you implement it,

44
00:02:19,650 --> 00:02:21,700
and you made a mistake
in how it's implemented.

45
00:02:21,700 --> 00:02:24,700
And that could cause
faults as well.

46
00:02:24,700 --> 00:02:26,760
And another example
of the kind of faults

47
00:02:26,760 --> 00:02:30,700
is an operational fault
sometimes called a human error

48
00:02:30,700 --> 00:02:33,220
where a user actually
does something

49
00:02:33,220 --> 00:02:35,960
that you didn't anticipate
or was told not to do,

50
00:02:35,960 --> 00:02:39,842
and that caused bad
things to happen.

51
00:02:39,842 --> 00:02:41,550
For all of these
faults, there are really

52
00:02:41,550 --> 00:02:43,370
two categories of
faults regardless

53
00:02:43,370 --> 00:02:45,840
of what kind of fault it is.

54
00:02:45,840 --> 00:02:48,955
The first category of
faults are latent faults.

55
00:02:51,860 --> 00:02:54,250
So an example of
a latent fault is,

56
00:02:54,250 --> 00:02:57,810
let's say you have a
bug in a program where

57
00:02:57,810 --> 00:03:00,550
instead of testing
if A less than B,

58
00:03:00,550 --> 00:03:02,540
you test if A greater than B.

59
00:03:02,540 --> 00:03:05,630
So that's a bug in a program.

60
00:03:05,630 --> 00:03:08,640
But until it actually runs,
until that line of code runs,

61
00:03:08,640 --> 00:03:10,319
this fault in the
program isn't actually

62
00:03:10,319 --> 00:03:11,360
going to do anything bad.

63
00:03:11,360 --> 00:03:14,330
It isn't going to have
any adverse effect.

64
00:03:14,330 --> 00:03:17,030
And therefore, this fault is
an example of a latent fault.

65
00:03:17,030 --> 00:03:20,410
And nothing is happening
until it gets triggered.

66
00:03:20,410 --> 00:03:22,970
And when it gets triggered,
that latent fault

67
00:03:22,970 --> 00:03:28,751
might become an active fault.

68
00:03:28,751 --> 00:03:31,250
Now, the problem when a latent
fault becomes an active fault

69
00:03:31,250 --> 00:03:35,150
is that when you run
that line of code,

70
00:03:35,150 --> 00:03:38,030
you might have a mistake
coming out at the output,

71
00:03:38,030 --> 00:03:40,820
which we're going
to call an error.

72
00:03:40,820 --> 00:03:43,590
So when an active
fault is exercised,

73
00:03:43,590 --> 00:03:45,370
it leads to an error.

74
00:03:45,370 --> 00:03:48,326
And the problem with errors
is that if you're not

75
00:03:48,326 --> 00:03:49,950
careful about how
you deal with errors,

76
00:03:49,950 --> 00:03:51,570
and most of what we're
going to talk about

77
00:03:51,570 --> 00:03:53,380
is how to deal with
errors, if you're not

78
00:03:53,380 --> 00:03:56,300
careful about how you deal with
errors that leads to a failure.

79
00:04:02,040 --> 00:04:03,530
So somewhat more
formally, a fault

80
00:04:03,530 --> 00:04:06,020
is just any flaw in an
underlying component

81
00:04:06,020 --> 00:04:09,710
or underlying subsystem
that your system is using.

82
00:04:09,710 --> 00:04:13,172
Now, if the fault turns out
not to be exercised then.

83
00:04:13,172 --> 00:04:13,880
There's no error.

84
00:04:13,880 --> 00:04:15,520
There's no error that
results, and there's

85
00:04:15,520 --> 00:04:16,520
no failure that results.

86
00:04:16,520 --> 00:04:18,320
It's only when you
have an active fault

87
00:04:18,320 --> 00:04:19,990
that you might have an error.

88
00:04:19,990 --> 00:04:22,736
And when you have an error,
you might have a failure.

89
00:04:22,736 --> 00:04:24,110
And what we're
going to try to do

90
00:04:24,110 --> 00:04:27,780
is to understand how to
deal with these errors

91
00:04:27,780 --> 00:04:29,450
so that when errors
occur we're going

92
00:04:29,450 --> 00:04:32,710
to try to hide them or
mask them, or do something

93
00:04:32,710 --> 00:04:35,900
such that these errors don't
propagate and cause failures.

94
00:04:40,330 --> 00:04:42,150
So the general goal,
as I mentioned before,

95
00:04:42,150 --> 00:04:46,850
is to build systems
that don't fail.

96
00:04:46,850 --> 00:04:49,200
So, in order to build
systems that don't fail,

97
00:04:49,200 --> 00:04:52,440
there are two approaches
at a 50,000 foot level.

98
00:04:52,440 --> 00:04:54,590
One approach to build a
system that doesn't fail

99
00:04:54,590 --> 00:04:56,059
is to build it
out of, make sure,

100
00:04:56,059 --> 00:04:57,600
every system is
going to be built out

101
00:04:57,600 --> 00:04:58,642
of components or modules.

102
00:04:58,642 --> 00:05:00,433
And those modules are
going to be built out

103
00:05:00,433 --> 00:05:01,590
of modules themselves.

104
00:05:01,590 --> 00:05:04,050
One approach might be to
make sure that no module ever

105
00:05:04,050 --> 00:05:06,960
fails, that no component that
you use to build your bigger

106
00:05:06,960 --> 00:05:08,750
system ever fails.

107
00:05:08,750 --> 00:05:11,360
And it'll turn out that for
reasons that will become clear

108
00:05:11,360 --> 00:05:13,200
based on an understanding
of the techniques

109
00:05:13,200 --> 00:05:16,670
we are going to employ to
build systems that don't fail,

110
00:05:16,670 --> 00:05:19,205
it'll turn out that this
is extremely expensive.

111
00:05:19,205 --> 00:05:21,330
It's just not going to work
out for us to make sure

112
00:05:21,330 --> 00:05:23,750
that our disks never
fail, and memory never

113
00:05:23,750 --> 00:05:26,530
fails, and our networks
never fail, and so on.

114
00:05:26,530 --> 00:05:29,774
It's just too expensive
and nearly impossible.

115
00:05:29,774 --> 00:05:31,190
So what we're going
to do actually

116
00:05:31,190 --> 00:05:33,100
is to start with
unreliable components.

117
00:05:40,150 --> 00:05:42,170
And we're going to build
reliable systems out

118
00:05:42,170 --> 00:05:46,940
of unreliable components
or modules more generally.

119
00:05:49,450 --> 00:05:53,310
And what this means is that
the system that you build

120
00:05:53,310 --> 00:05:57,060
had better be tolerant of
faults that these underlying

121
00:05:57,060 --> 00:06:00,780
components have, which is
why the design of systems

122
00:06:00,780 --> 00:06:04,960
that don't fail or rarely
fail is essentially

123
00:06:04,960 --> 00:06:07,480
the same as designing systems
that are tolerant of faults,

124
00:06:07,480 --> 00:06:08,680
hence fault tolerance.

125
00:06:08,680 --> 00:06:13,710
So that's the reason why we
care about fault tolerance.

126
00:06:13,710 --> 00:06:15,720
So let's take the
example of the kinds

127
00:06:15,720 --> 00:06:19,500
of, just to crystallize these
notions of faults and failures

128
00:06:19,500 --> 00:06:20,420
a little bit more.

129
00:06:20,420 --> 00:06:23,460
So let's say you have a big
system that has a module.

130
00:06:23,460 --> 00:06:24,620
Let's call it M1.

131
00:06:24,620 --> 00:06:30,970
And this module uses a couple
of other modules, M2 and M3.

132
00:06:30,970 --> 00:06:34,990
And let's say M2 uses
another module, M4,

133
00:06:34,990 --> 00:06:36,850
where users might
be an [indication?].

134
00:06:36,850 --> 00:06:40,570
Or imagine this is an
RPC call, for example.

135
00:06:43,420 --> 00:06:46,800
And, let's say that M4 in here
has some component inside M4

136
00:06:46,800 --> 00:06:49,970
like a disk or something,
some piece of software in M4.

137
00:06:49,970 --> 00:06:53,020
And, let's say that that fails.

138
00:06:53,020 --> 00:06:53,840
So, it has a fault.

139
00:06:53,840 --> 00:06:54,690
It gets triggered.

140
00:06:54,690 --> 00:06:56,368
It becomes active,
leads to an error.

141
00:06:56,368 --> 00:06:58,076
It actually fails,
that little component.

142
00:07:01,100 --> 00:07:03,890
So when this fault becomes a
failure, a couple of things

143
00:07:03,890 --> 00:07:05,340
could happen.

144
00:07:05,340 --> 00:07:14,069
M4, which is the module to which
this little failure belongs,

145
00:07:14,069 --> 00:07:15,110
can do one of two things.

146
00:07:15,110 --> 00:07:18,500
One possibility
is that this fault

147
00:07:18,500 --> 00:07:21,310
that caused the failure
gets exposed to the caller.

148
00:07:21,310 --> 00:07:23,800
So, M4 hasn't
managed to figure out

149
00:07:23,800 --> 00:07:27,160
a way to hide this failure
from M2, which means

150
00:07:27,160 --> 00:07:29,170
that the fault propagates up.

151
00:07:29,170 --> 00:07:32,690
The failure gets visible, and
the fault propagates up to M2.

152
00:07:32,690 --> 00:07:38,920
And now, M2 actually sees the
underlying component's failure.

153
00:07:38,920 --> 00:07:42,670
So the point here is that
this little component fault

154
00:07:42,670 --> 00:07:44,620
caused a failure
here which caused

155
00:07:44,620 --> 00:07:49,610
M4 itself to fail because M4 now
couldn't hide this underlying

156
00:07:49,610 --> 00:07:52,020
failure, and reported
something that

157
00:07:52,020 --> 00:07:54,880
was a failure, that
was an output that

158
00:07:54,880 --> 00:07:59,030
didn't conform to the
specification of M4 out to M2.

159
00:07:59,030 --> 00:08:02,810
Now, as far as M2 is concerned,
all that has happened so far

160
00:08:02,810 --> 00:08:06,240
is that the failure
of this module, M4,

161
00:08:06,240 --> 00:08:10,250
has shown up as a
fault to M2, right,

162
00:08:10,250 --> 00:08:12,780
because an underlying
module has failed.

163
00:08:12,780 --> 00:08:14,590
It doesn't mean
that M2 has failed.

164
00:08:14,590 --> 00:08:17,510
It just means that M2
has now seen a fault.

165
00:08:17,510 --> 00:08:21,150
And M2 now might manage
to hide this fault, which

166
00:08:21,150 --> 00:08:23,400
would mean that M1 doesn't
actually see anything.

167
00:08:23,400 --> 00:08:26,480
It doesn't see the
underlying fault

168
00:08:26,480 --> 00:08:28,600
that caused the failure at all.

169
00:08:28,600 --> 00:08:30,940
But of course, if M2
now couldn't hide this

170
00:08:30,940 --> 00:08:33,000
or couldn't mask
this failure, then it

171
00:08:33,000 --> 00:08:36,010
would propagate an
erroneous output out

172
00:08:36,010 --> 00:08:37,809
to M1, an output
that didn't conform

173
00:08:37,809 --> 00:08:40,570
to the specification
of M2, leading M1

174
00:08:40,570 --> 00:08:43,070
to observe this as
a fault, and so on.

175
00:08:43,070 --> 00:08:47,100
So, the general idea is
that failures of sub-modules

176
00:08:47,100 --> 00:08:49,550
tend to show up as faults
in the higher level module.

177
00:08:53,240 --> 00:08:55,240
And our goal is to
try to somehow design

178
00:08:55,240 --> 00:08:57,480
these systems that use lots
of modules and components

179
00:08:57,480 --> 00:09:00,370
where at some level
in the end we would

180
00:09:00,370 --> 00:09:02,610
like to avoid failing overall.

181
00:09:02,610 --> 00:09:06,270
But inside here, we won't
be able to go about making

182
00:09:06,270 --> 00:09:08,340
everything failure-free.

183
00:09:08,340 --> 00:09:10,660
I mean, there might be
failures inside sub-modules.

184
00:09:10,660 --> 00:09:13,410
But the idea is to
ensure, or try to ensure,

185
00:09:13,410 --> 00:09:15,920
that M1 itself, the highest
level system, doesn't fail.

186
00:09:18,700 --> 00:09:20,225
So let's start with
a few examples.

187
00:09:24,427 --> 00:09:27,010
In fact, these all examples of
things that we've already seen.

188
00:09:27,010 --> 00:09:29,650
And even though we haven't
discussed it as such,

189
00:09:29,650 --> 00:09:32,720
we've seen a lot of examples
of fault tolerance in the class

190
00:09:32,720 --> 00:09:33,220
so far.

191
00:09:33,220 --> 00:09:37,885
So, for example, if you have
bad synchronization code

192
00:09:37,885 --> 00:09:40,010
like you didn't use the
locking discipline properly

193
00:09:40,010 --> 00:09:44,870
or didn't use any of the other
synchronization primitives

194
00:09:44,870 --> 00:09:50,520
properly, you might have a
software fault that leads

195
00:09:50,520 --> 00:09:53,740
to the failure of a module.

196
00:09:53,740 --> 00:09:56,650
Another example that we saw
when we talk about networking

197
00:09:56,650 --> 00:10:02,450
is when we talk about routing
where the idea in here

198
00:10:02,450 --> 00:10:04,880
was that we talked about
rat in protocols that

199
00:10:04,880 --> 00:10:06,350
could handle failures of links.

200
00:10:06,350 --> 00:10:09,070
So, certain links could fail,
leading to certain paths

201
00:10:09,070 --> 00:10:10,700
to not be usable.

202
00:10:10,700 --> 00:10:13,060
But, the routing system
managed to find other paths

203
00:10:13,060 --> 00:10:13,980
around the network.

204
00:10:13,980 --> 00:10:15,500
And that was because
there were other parts

205
00:10:15,500 --> 00:10:16,999
available because
the network itself

206
00:10:16,999 --> 00:10:19,350
was built with some degree
of redundancy underneath.

207
00:10:19,350 --> 00:10:24,440
And the routing protocol
was able to exploit that.

208
00:10:24,440 --> 00:10:27,440
Another example that we
saw again from networks

209
00:10:27,440 --> 00:10:30,474
is packet loss.

210
00:10:30,474 --> 00:10:32,640
We had best effort networks
that would lose packets.

211
00:10:32,640 --> 00:10:35,590
And it didn't mean that
your actual transfer

212
00:10:35,590 --> 00:10:38,760
of a file at the ends and
where would miss data.

213
00:10:38,760 --> 00:10:41,370
We came up with
retransmissions as a mechanism

214
00:10:41,370 --> 00:10:44,120
to use, again, another
form of redundancy

215
00:10:44,120 --> 00:10:46,620
where you try the same thing
again to get your data through.

216
00:10:50,090 --> 00:10:52,930
Another example of the failure
that we saw was congestion

217
00:10:52,930 --> 00:10:53,904
collapse --

218
00:10:56,810 --> 00:11:02,500
-- where there was too much data
being sent out into the network

219
00:11:02,500 --> 00:11:04,500
too fast, and the
network would collapse.

220
00:11:04,500 --> 00:11:06,350
And our solution to
this problem was really

221
00:11:06,350 --> 00:11:09,800
to shed load was to run the
system slower than it otherwise

222
00:11:09,800 --> 00:11:12,100
would, by having
the people sending

223
00:11:12,100 --> 00:11:16,440
data send data slower in order
to alleviate this problem.

224
00:11:16,440 --> 00:11:18,930
Another example which
we saw last time was,

225
00:11:18,930 --> 00:11:22,590
or briefly saw last time was
[the domain?] name system where

226
00:11:22,590 --> 00:11:24,320
the domain name
servers are replicated.

227
00:11:24,320 --> 00:11:26,900
So, if you couldn't reach one
to resolve your domain name,

228
00:11:26,900 --> 00:11:28,830
you could go to another one.

229
00:11:28,830 --> 00:11:30,290
And all of these,
or most of these

230
00:11:30,290 --> 00:11:33,159
actually use the same techniques
that we're going to talk about.

231
00:11:33,159 --> 00:11:34,700
And all of these
techniques are built

232
00:11:34,700 --> 00:11:39,880
around some form of redundancy
on another except probably

233
00:11:39,880 --> 00:11:40,630
the locking thing.

234
00:11:40,630 --> 00:11:43,830
But all of the others are built
around some form of redundancy.

235
00:11:43,830 --> 00:11:46,940
And we'll understand this
more systematically today

236
00:11:46,940 --> 00:11:49,740
and in the next
couple of classes.

237
00:11:49,740 --> 00:11:52,401
So our goal here is to develop
a systematic approach --

238
00:11:55,490 --> 00:12:02,640
-- to building systems
that are fault tolerant.

239
00:12:02,640 --> 00:12:04,940
And the general approach for
all fault tolerant systems

240
00:12:04,940 --> 00:12:06,750
is to use three techniques.

241
00:12:06,750 --> 00:12:09,180
So the first one we've
already seen, which is

242
00:12:09,180 --> 00:12:10,550
don't build a monolithic system.

243
00:12:10,550 --> 00:12:11,841
Always build it around modules.

244
00:12:15,110 --> 00:12:16,750
And the reason is
that it will that

245
00:12:16,750 --> 00:12:20,580
be easier for us to isolate
these modules firstly

246
00:12:20,580 --> 00:12:21,660
one from another.

247
00:12:21,660 --> 00:12:24,180
But then, when
modules fail, it will

248
00:12:24,180 --> 00:12:26,660
be easier for us to treat
those failures as faults,

249
00:12:26,660 --> 00:12:28,670
and then try to
hide those faults,

250
00:12:28,670 --> 00:12:33,090
and apply the same
technique, which brings us

251
00:12:33,090 --> 00:12:36,890
to the second step, which is
when failures occur causing

252
00:12:36,890 --> 00:12:41,110
errors, we need a plan
for the higher level

253
00:12:41,110 --> 00:12:42,680
module to detect errors.

254
00:12:47,010 --> 00:12:48,420
So failure results in an error.

255
00:12:48,420 --> 00:12:51,180
We have to know
that it's happened,

256
00:12:51,180 --> 00:12:53,796
which means we need
techniques to detect it.

257
00:12:53,796 --> 00:12:55,420
And, of course, once
we detect an error

258
00:12:55,420 --> 00:12:57,230
we have a bunch of things
we could do with it.

259
00:12:57,230 --> 00:12:58,688
But ideally, if
you want to prevent

260
00:12:58,688 --> 00:13:01,640
the failure of that system, of
a system that's observed errors,

261
00:13:01,640 --> 00:13:04,820
you need a way to
hide these errors.

262
00:13:04,820 --> 00:13:06,790
The jargon for this
is mask errors.

263
00:13:12,110 --> 00:13:14,390
And if we do this, if we
build systems that do this,

264
00:13:14,390 --> 00:13:16,250
then it's possible
for us to build

265
00:13:16,250 --> 00:13:18,330
systems that can form to spec.

266
00:13:18,330 --> 00:13:20,585
So the goal here is to try
to make sure that systems

267
00:13:20,585 --> 00:13:21,835
conform to some specification.

268
00:13:25,922 --> 00:13:27,880
And if things don't
conform to a specification,

269
00:13:27,880 --> 00:13:31,390
then that's when we
call it a failure.

270
00:13:31,390 --> 00:13:32,840
And sometimes we
play some tricks

271
00:13:32,840 --> 00:13:36,760
where in order to build
systems that "never fail",

272
00:13:36,760 --> 00:13:39,670
we'll scale back the
specification to actually allow

273
00:13:39,670 --> 00:13:43,670
for things that would in
fact be considered failures,

274
00:13:43,670 --> 00:13:46,120
but are things that still
would conform to the spec.

275
00:13:46,120 --> 00:13:48,180
So we relax the
specification to make sure

276
00:13:48,180 --> 00:13:53,130
that we could still meet the
notion of a failure free system

277
00:13:53,130 --> 00:13:54,290
or a fault tolerant system.

278
00:13:54,290 --> 00:13:56,280
And we'll see some
examples of that

279
00:13:56,280 --> 00:13:57,700
actually in the next lecture.

280
00:14:00,410 --> 00:14:02,840
And I've already mentioned,
the general trick

281
00:14:02,840 --> 00:14:06,240
for all of these systems
that we're going to study,

282
00:14:06,240 --> 00:14:07,750
examples that we're
going to study,

283
00:14:07,750 --> 00:14:09,890
is to use some
form of redundance.

284
00:14:12,779 --> 00:14:15,070
And that's the way in which
we're going to mask errors.

285
00:14:15,070 --> 00:14:19,560
And almost all systems,
or every system

286
00:14:19,560 --> 00:14:21,060
that I know of
that's fault tolerant

287
00:14:21,060 --> 00:14:23,160
uses redundancy in
some form or another.

288
00:14:23,160 --> 00:14:25,150
And often it's not
obvious how it uses it.

289
00:14:25,150 --> 00:14:27,020
But it does actually
use redundancy.

290
00:14:32,160 --> 00:14:35,260
So I'm going to now give an
example that will turn out

291
00:14:35,260 --> 00:14:38,070
to be the same example we'll
use for the next three or four

292
00:14:38,070 --> 00:14:38,800
lectures.

293
00:14:38,800 --> 00:14:40,947
And so, you may as well,
you should probably

294
00:14:40,947 --> 00:14:42,780
get familiar with this
example because we're

295
00:14:42,780 --> 00:14:44,400
going to see this
over and over again.

296
00:14:44,400 --> 00:14:47,970
It's a really simple example,
but it's complicated enough

297
00:14:47,970 --> 00:14:50,670
that everything we want to
learn about fault tolerance

298
00:14:50,670 --> 00:14:52,907
will be visible in this example.

299
00:14:52,907 --> 00:14:54,490
So it starts with
the person who wants

300
00:14:54,490 --> 00:15:03,640
to do a bank transaction at an
ATM, or a PC, or on a computer.

301
00:15:03,640 --> 00:15:05,140
You want to do a
bank transaction.

302
00:15:05,140 --> 00:15:07,580
And the way this works, as
you probably know, is it

303
00:15:07,580 --> 00:15:11,150
goes over some
kind of a network.

304
00:15:11,150 --> 00:15:14,600
And then, if you want to
do this bank transaction,,

305
00:15:14,600 --> 00:15:20,060
it goes to a server,
which is run by your bank.

306
00:15:20,060 --> 00:15:21,640
And the way this
normally works is

307
00:15:21,640 --> 00:15:24,250
that the server
has a module that

308
00:15:24,250 --> 00:15:32,870
uses a database system,
which deals with managing

309
00:15:32,870 --> 00:15:34,530
your account information.

310
00:15:34,530 --> 00:15:36,921
And because you
don't want to forget,

311
00:15:36,921 --> 00:15:39,170
and the bank shouldn't forget
how much money you have,

312
00:15:39,170 --> 00:15:42,960
there is data that's
stored on disk.

313
00:15:42,960 --> 00:15:45,750
And we're going to be
doing things that are

314
00:15:45,750 --> 00:15:47,810
actions of the following form.

315
00:15:47,810 --> 00:15:53,540
We're going to be
asking the transfer

316
00:15:53,540 --> 00:15:58,340
from some account to another
account some amount of money.

317
00:16:02,530 --> 00:16:05,290
And now, of course, anything
could fail in between.

318
00:16:05,290 --> 00:16:08,200
So, for example, there could
be a problem in the network.

319
00:16:08,200 --> 00:16:09,470
And the network could fail.

320
00:16:09,470 --> 00:16:11,320
Or the software running
on the server could fail.

321
00:16:11,320 --> 00:16:13,153
Or the software running
this database system

322
00:16:13,153 --> 00:16:17,580
could crash or report
bad values or something.

323
00:16:17,580 --> 00:16:19,150
The disc could fail.

324
00:16:19,150 --> 00:16:22,790
And do we want systematic
techniques by which this

325
00:16:22,790 --> 00:16:25,720
transfer here or
all of these calls

326
00:16:25,720 --> 00:16:29,260
that look a lot like
transfer do the right thing?

327
00:16:29,260 --> 00:16:31,250
And so, this doing
the right thing

328
00:16:31,250 --> 00:16:34,150
is an informal way of
saying meet a specification.

329
00:16:34,150 --> 00:16:38,190
So, we first have to decide what
we want for a specification.

330
00:16:38,190 --> 00:16:40,290
That has to hold true
no matter what happens,

331
00:16:40,290 --> 00:16:41,810
no matter what failures occur.

332
00:16:41,810 --> 00:16:43,730
So one example of
a specification

333
00:16:43,730 --> 00:16:47,410
might be to say, no matter
what happens, if I invoke this

334
00:16:47,410 --> 00:16:51,140
and it returns, then
this amount of money

335
00:16:51,140 --> 00:16:53,430
has to be transferred
from here to here.

336
00:16:53,430 --> 00:16:57,170
So that could be a specification
that you might expect.

337
00:16:57,170 --> 00:16:59,420
It also turns out this
specification is extremely hard

338
00:16:59,420 --> 00:16:59,750
to meet.

339
00:16:59,750 --> 00:17:01,010
And we're not even
going to try to do it.

340
00:17:01,010 --> 00:17:03,259
And this is the [weasel?]
wording I said before about,

341
00:17:03,259 --> 00:17:04,750
we'll modify the specification.

342
00:17:04,750 --> 00:17:07,359
So, we'll change the
specification going forward

343
00:17:07,359 --> 00:17:11,079
for this example to mean,
if this call returns,

344
00:17:11,079 --> 00:17:14,160
then no matter what
failures occur,

345
00:17:14,160 --> 00:17:17,349
either a transfer has
happened exactly once,

346
00:17:17,349 --> 00:17:20,450
or the state of the system is as
if the transfer didn't even get

347
00:17:20,450 --> 00:17:23,770
started, OK, which
is reasonable.

348
00:17:23,770 --> 00:17:27,249
I mean, and then if you really
care about moving the money,

349
00:17:27,249 --> 00:17:29,290
and you are determined
that it hasn't been moved,

350
00:17:29,290 --> 00:17:31,340
you or some program
might try it again,

351
00:17:31,340 --> 00:17:34,460
which actually is another
form of using redundancy where

352
00:17:34,460 --> 00:17:37,500
you just try it over again.

353
00:17:37,500 --> 00:17:41,280
And we won't
understand completely

354
00:17:41,280 --> 00:17:45,130
why a specification that says
you have to do this exactly

355
00:17:45,130 --> 00:17:47,940
once if it returns, why
that's hard to implement,

356
00:17:47,940 --> 00:17:49,801
why that's hard to
achieve, we'll see that

357
00:17:49,801 --> 00:17:51,050
in the next couple of classes.

358
00:17:51,050 --> 00:17:53,550
So, for now, just realized
that the specification here

359
00:17:53,550 --> 00:17:55,750
is it should happen
exactly once or it should

360
00:17:55,750 --> 00:18:00,527
be as if no partial
action corresponding

361
00:18:00,527 --> 00:18:02,610
to the [UNINTELLIGIBLE]
of this transfer happened.

362
00:18:02,610 --> 00:18:05,710
So the state of the system
must be as if the system never

363
00:18:05,710 --> 00:18:06,810
saw this transfer request.

364
00:18:10,350 --> 00:18:11,750
So any module could failure.

365
00:18:11,750 --> 00:18:13,560
So, let's take some
examples of failures

366
00:18:13,560 --> 00:18:15,820
in order to get some
terminology that'll

367
00:18:15,820 --> 00:18:18,471
help us understand faults.

368
00:18:18,471 --> 00:18:20,220
So one thing that could
happen is that you

369
00:18:20,220 --> 00:18:22,780
could have a disk failure.

370
00:18:22,780 --> 00:18:24,360
So the disc could just fail.

371
00:18:24,360 --> 00:18:28,130
And one example of a disk
failure is the disk fails

372
00:18:28,130 --> 00:18:30,385
and then it just stops working.

373
00:18:30,385 --> 00:18:32,010
And it tells the
database system that's

374
00:18:32,010 --> 00:18:34,510
trying to read and write data
from it that it isn't working.

375
00:18:36,720 --> 00:18:40,192
So if that kind of failure
happens where this module here

376
00:18:40,192 --> 00:18:41,650
with this component
just completely

377
00:18:41,650 --> 00:18:44,870
stops and tells the higher-level
module that it stopped,

378
00:18:44,870 --> 00:18:46,230
that's an example of a failure.

379
00:18:46,230 --> 00:18:47,646
That's called a
fail stop failure.

380
00:18:50,522 --> 00:18:51,980
And more generally,
any module that

381
00:18:51,980 --> 00:18:54,370
tells the higher-level
module that it just

382
00:18:54,370 --> 00:18:58,430
stops working without reporting
anything else, no outputs,

383
00:18:58,430 --> 00:19:00,070
that's fail stop.

384
00:19:00,070 --> 00:19:04,020
Of course, you could have disks.

385
00:19:04,020 --> 00:19:07,430
And you might have failures
that aren't fail stop.

386
00:19:07,430 --> 00:19:09,870
You might have
something where there

387
00:19:09,870 --> 00:19:11,990
is some kind of error
checking associated

388
00:19:11,990 --> 00:19:13,570
with every sector on your disk.

389
00:19:13,570 --> 00:19:17,650
And, disk might start
reporting errors that

390
00:19:17,650 --> 00:19:19,557
say that this is a bad sector.

391
00:19:19,557 --> 00:19:21,890
So, it doesn't fail stop, but
it tells the higher level,

392
00:19:21,890 --> 00:19:25,100
the database system in this
case that some data that's read,

393
00:19:25,100 --> 00:19:27,260
or some data that's
been written,

394
00:19:27,260 --> 00:19:29,740
there's a bad
sector, which means

395
00:19:29,740 --> 00:19:34,450
that the checksum doesn't match
the data that's being read.

396
00:19:34,450 --> 00:19:37,730
When you have an
error like that where

397
00:19:37,730 --> 00:19:39,720
it doesn't stop working
but it tells you

398
00:19:39,720 --> 00:19:42,780
that something bad is going on,
that's an example of a failure.

399
00:19:42,780 --> 00:19:44,250
That's called a
fail fast failure.

400
00:19:47,041 --> 00:19:49,540
I actually don't think these
terms, that most of these terms

401
00:19:49,540 --> 00:19:50,665
are particularly important.

402
00:19:50,665 --> 00:19:52,834
Fail stop is usually
important and worth knowing,

403
00:19:52,834 --> 00:19:54,500
but the reason to go
through these terms

404
00:19:54,500 --> 00:19:57,900
is more to understand that there
are various kinds of failures

405
00:19:57,900 --> 00:19:58,470
possible.

406
00:19:58,470 --> 00:20:00,094
So in one case it stops working.

407
00:20:00,094 --> 00:20:01,510
In another case,
it just tells you

408
00:20:01,510 --> 00:20:03,770
that it's not working
but continues working.

409
00:20:03,770 --> 00:20:06,240
It tells you that
certain operations

410
00:20:06,240 --> 00:20:07,640
haven't been correctly done.

411
00:20:10,610 --> 00:20:13,420
Now, another thing that could
happen when, for example,

412
00:20:13,420 --> 00:20:19,430
the disc has fail
stop, has fail fast

413
00:20:19,430 --> 00:20:21,420
is that the database
system might decide

414
00:20:21,420 --> 00:20:25,980
that right operations, you're
not allowed to write things

415
00:20:25,980 --> 00:20:30,050
to disk because the disk is
either fail completely or is

416
00:20:30,050 --> 00:20:31,240
fail fast.

417
00:20:31,240 --> 00:20:35,700
But it might allow actions or
requests that are read only.

418
00:20:35,700 --> 00:20:37,290
So, for example, it
might allow users

419
00:20:37,290 --> 00:20:38,890
to come up to an ATM
machine, and just

420
00:20:38,890 --> 00:20:40,890
read how much money they
have from their account

421
00:20:40,890 --> 00:20:43,850
because it might be that there
is a cache of the data that's

422
00:20:43,850 --> 00:20:45,940
in memory in the database.

423
00:20:45,940 --> 00:20:49,170
So it might allow read-only
actions, in which case

424
00:20:49,170 --> 00:20:53,670
the system's perform
is functioning

425
00:20:53,670 --> 00:20:55,720
with only a subset
of the actions

426
00:20:55,720 --> 00:20:57,440
that it's supposed to be taking.

427
00:20:57,440 --> 00:20:59,460
And if that happens,
that kind of failure

428
00:20:59,460 --> 00:21:07,560
is called a fail soft failure,
where not all of the interfaces

429
00:21:07,560 --> 00:21:09,620
are available, but a
subset of the interfaces

430
00:21:09,620 --> 00:21:11,120
are available and
correctly working.

431
00:21:14,424 --> 00:21:16,340
And the last kind of
failure that could happen

432
00:21:16,340 --> 00:21:21,334
is that in this
example, let's say

433
00:21:21,334 --> 00:21:23,000
that failures are
occurring when there's

434
00:21:23,000 --> 00:21:27,100
a large number of people trying
to make these requests at ATMs.

435
00:21:27,100 --> 00:21:31,080
And, there is some
problems that have arisen.

436
00:21:31,080 --> 00:21:33,840
And somebody determines
that the problem

437
00:21:33,840 --> 00:21:35,640
arises when there
is too many people

438
00:21:35,640 --> 00:21:38,465
gaining access to the
system at the same time.

439
00:21:38,465 --> 00:21:40,090
And the system might
now move to a mold

440
00:21:40,090 --> 00:21:42,980
where it allows only a small
number of actions at a time,

441
00:21:42,980 --> 00:21:44,970
a small number of concurrent
actions at a time,

442
00:21:44,970 --> 00:21:46,330
or maybe one action at a time.

443
00:21:46,330 --> 00:21:49,320
So, one user can come at a
time to the system, which

444
00:21:49,320 --> 00:21:52,990
means the systems, there
has been a failure,

445
00:21:52,990 --> 00:21:55,690
but the way the system's dealing
with it is that it determines

446
00:21:55,690 --> 00:22:01,210
that the failure doesn't get
triggered when the load is low.

447
00:22:01,210 --> 00:22:04,334
So it might function
at low performance.

448
00:22:04,334 --> 00:22:06,000
It still provides all
of the interfaces,

449
00:22:06,000 --> 00:22:09,190
but just at very low performance
or at lower performance.

450
00:22:09,190 --> 00:22:11,315
And that kind of behavior
is called failsafe.

451
00:22:14,120 --> 00:22:17,450
So it's moved to a mode
where it's just scaled back

452
00:22:17,450 --> 00:22:19,310
how much work it's
willing to do,

453
00:22:19,310 --> 00:22:21,365
and does it at
degraded performance.

454
00:22:42,830 --> 00:22:49,030
OK, so the plan now is
for the rest of today,

455
00:22:49,030 --> 00:22:51,130
so tomorrow from
the next lecture on,

456
00:22:51,130 --> 00:22:53,470
what we're going to do
is understand algorithms

457
00:22:53,470 --> 00:22:57,500
for how we go about and how
you build systems that actually

458
00:22:57,500 --> 00:23:00,720
do one or all of these in
order to meet the specification

459
00:23:00,720 --> 00:23:02,055
that we want.

460
00:23:02,055 --> 00:23:04,430
But before we do that you have
to understand a little bit

461
00:23:04,430 --> 00:23:06,652
about models for faults.

462
00:23:06,652 --> 00:23:08,360
In order to build
fault tolerant systems,

463
00:23:08,360 --> 00:23:10,420
it's usually a good
idea to understand

464
00:23:10,420 --> 00:23:18,634
a little bit more
quantitatively models or faults

465
00:23:18,634 --> 00:23:19,550
that occur in systems.

466
00:23:19,550 --> 00:23:21,010
And primarily,
this discussion is

467
00:23:21,010 --> 00:23:23,590
going to be focused
on hardware faults

468
00:23:23,590 --> 00:23:26,060
because most people don't
understand how software

469
00:23:26,060 --> 00:23:27,682
faults are to be modeled.

470
00:23:27,682 --> 00:23:29,140
But since all our
systems are going

471
00:23:29,140 --> 00:23:31,127
to be built on hardware,
for example discs

472
00:23:31,127 --> 00:23:32,710
are going to be
really, really common.

473
00:23:32,710 --> 00:23:34,418
Our network links are
going to be common.

474
00:23:34,418 --> 00:23:36,990
And all of those conform
nicely to models.

475
00:23:36,990 --> 00:23:39,230
It's worth understanding
how that works.

476
00:23:39,230 --> 00:23:43,230
So, for example, a
disk manufacturer

477
00:23:43,230 --> 00:23:47,750
might report that the error rate
of undetected errors, so disks

478
00:23:47,750 --> 00:23:50,180
usually have a fair amount
of error detection in them.

479
00:23:50,180 --> 00:23:53,320
But, they might report that the
error rate of undetected errors

480
00:23:53,320 --> 00:23:56,100
is, say, ten to the minus
12 or ten to the minus 13.

481
00:23:56,100 --> 00:23:57,840
And that number
looks really small.

482
00:23:57,840 --> 00:24:00,180
That says that out
of that many bits,

483
00:24:00,180 --> 00:24:03,362
maybe one bit is corrupted,
and you can't detect it.

484
00:24:03,362 --> 00:24:04,820
But, you have to
realize that given

485
00:24:04,820 --> 00:24:07,650
modern workloads, for example
take Google as an example

486
00:24:07,650 --> 00:24:09,757
that you saw from
last recitation,

487
00:24:09,757 --> 00:24:12,340
the amount of data that's being
stored in the system like that

488
00:24:12,340 --> 00:24:16,160
or in the world in general is
so huge that a ten to the minus

489
00:24:16,160 --> 00:24:18,230
13th error rate means
that you're probably

490
00:24:18,230 --> 00:24:21,570
seeing some bad data and
file that you can never fix

491
00:24:21,570 --> 00:24:26,700
or never detect
every couple of days.

492
00:24:26,700 --> 00:24:29,750
Network people would
tell you that fiber optic

493
00:24:29,750 --> 00:24:32,994
links have an error rate of
one error in ten to the 12th.

494
00:24:32,994 --> 00:24:35,160
But you have to realize
that these links are sending

495
00:24:35,160 --> 00:24:37,940
so many gigabits per second
that one error in ten

496
00:24:37,940 --> 00:24:39,980
to the 12th means
something like there's

497
00:24:39,980 --> 00:24:46,204
an error that you can't detect
maybe every couple of hours.

498
00:24:46,204 --> 00:24:48,370
What that really means is
that at the higher layers,

499
00:24:48,370 --> 00:24:49,744
you need to do
more work in order

500
00:24:49,744 --> 00:24:52,350
to make sure that your data
is protected because you can't

501
00:24:52,350 --> 00:24:55,110
actually rely on the fact that
your underlying components have

502
00:24:55,110 --> 00:24:57,680
these amazingly low numbers
because there's so much data

503
00:24:57,680 --> 00:25:00,827
going on, being sent or
being stored on these systems

504
00:25:00,827 --> 00:25:03,160
that you need to have other
techniques at a higher layer

505
00:25:03,160 --> 00:25:05,359
to protect, if you really
care about the integrity

506
00:25:05,359 --> 00:25:05,900
of your data.

507
00:25:08,430 --> 00:25:10,170
In addition to
these raw numbers,

508
00:25:10,170 --> 00:25:12,770
there's two or
three other metrics

509
00:25:12,770 --> 00:25:17,922
that people use to understand
faults and failures.

510
00:25:17,922 --> 00:25:20,005
The first one is the number
of tolerated failures.

511
00:25:25,330 --> 00:25:28,760
So, for example, if you
build a system to store data

512
00:25:28,760 --> 00:25:32,550
and you're worried about discs
failing or discs reporting

513
00:25:32,550 --> 00:25:35,010
at [earnest?] values, you might
replicate that data across

514
00:25:35,010 --> 00:25:36,440
many, many discs.

515
00:25:36,440 --> 00:25:38,040
And then when you
design your system,

516
00:25:38,040 --> 00:25:40,289
one of the things you would
want to analyze and report

517
00:25:40,289 --> 00:25:42,830
is the number of tolerated
failures of discs.

518
00:25:42,830 --> 00:25:46,350
So, for example, if you build
a system out of seven discs,

519
00:25:46,350 --> 00:25:49,740
you might say that you can
handle up to two failed disks,

520
00:25:49,740 --> 00:25:51,540
or simply like that,
depending on how

521
00:25:51,540 --> 00:25:53,610
you've designed your system.

522
00:25:53,610 --> 00:25:56,160
And that's usually a good thing
to report because then people

523
00:25:56,160 --> 00:25:58,300
who use your system can
know how to provision

524
00:25:58,300 --> 00:26:04,260
or engineer your system.

525
00:26:04,260 --> 00:26:06,360
The second metric
which we're going

526
00:26:06,360 --> 00:26:08,619
to spend a few more minutes
on is something called

527
00:26:08,619 --> 00:26:09,660
the mean time to failure.

528
00:26:21,400 --> 00:26:24,750
And what this says is
it takes a model where

529
00:26:24,750 --> 00:26:26,880
you have a system that
starts at time zero,

530
00:26:26,880 --> 00:26:27,800
and it's running fine.

531
00:26:27,800 --> 00:26:30,190
And then at some point
in time, it fails.

532
00:26:30,190 --> 00:26:33,570
And then, when it
fails, that error

533
00:26:33,570 --> 00:26:36,660
is made known to an operator.

534
00:26:36,660 --> 00:26:38,090
Or it's made known
to some higher

535
00:26:38,090 --> 00:26:40,420
level that has a plan
to work around it

536
00:26:40,420 --> 00:26:42,232
to repair this failure.

537
00:26:42,232 --> 00:26:43,940
And then, once the
failure gets repaired,

538
00:26:43,940 --> 00:26:46,064
it takes some time for the
failure to get repaired.

539
00:26:46,064 --> 00:26:48,240
And once it's repaired
it starts running again.

540
00:26:48,240 --> 00:26:51,100
And then it fails at some
other point in the future.

541
00:26:51,100 --> 00:26:53,640
And when it goes through the
cycle of failures and repairs,

542
00:26:53,640 --> 00:26:57,300
you end up with a timeline
that looks like this.

543
00:26:57,300 --> 00:27:00,160
So you might start at time zero.

544
00:27:00,160 --> 00:27:03,450
And the system is working fine.

545
00:27:03,450 --> 00:27:06,350
And then there is a
failure that happens here.

546
00:27:06,350 --> 00:27:09,820
And then the system is down
for a certain period of time.

547
00:27:09,820 --> 00:27:13,560
And then somebody
repairs the system,

548
00:27:13,560 --> 00:27:15,902
and then it continues to work.

549
00:27:15,902 --> 00:27:17,360
And then it fails
again, and so on.

550
00:27:27,050 --> 00:27:28,860
And so, for each of
the durations of time

551
00:27:28,860 --> 00:27:30,520
that the system
is working, let's

552
00:27:30,520 --> 00:27:32,730
assume it's starting
at zero, each of these

553
00:27:32,730 --> 00:27:35,460
defines a period of
time that I'm going

554
00:27:35,460 --> 00:27:38,090
to call TTF or time to fail.

555
00:27:38,090 --> 00:27:41,780
OK, so this is the first
time to fail interval.

556
00:27:41,780 --> 00:27:53,710
This is the second time to
fail This is the second time

557
00:27:53,710 --> 00:27:55,710
to fail interval,
time to repair,

558
00:27:55,710 --> 00:27:59,770
and this is the third time
to fail interval, and so on.

559
00:27:59,770 --> 00:28:02,320
And analogously
in between here, I

560
00:28:02,320 --> 00:28:04,330
could define these time
to repair intervals,

561
00:28:04,330 --> 00:28:09,940
TTR1, TTR2, and so on.

562
00:28:09,940 --> 00:28:13,380
So, the mean time to failure is
just the mean of these values.

563
00:28:13,380 --> 00:28:14,840
There's some duration of time.

564
00:28:14,840 --> 00:28:16,298
You're like, three
hours the system

565
00:28:16,298 --> 00:28:18,330
worked, and then it crashed.

566
00:28:18,330 --> 00:28:19,220
That's TTF1.

567
00:28:19,220 --> 00:28:22,020
And then, somebody to repair
it worked now for six hours.

568
00:28:22,020 --> 00:28:23,300
That's TTF2, and so on.

569
00:28:23,300 --> 00:28:26,140
If you run your system for
a long enough period of time

570
00:28:26,140 --> 00:28:28,750
like a disk or anything
else, and then you

571
00:28:28,750 --> 00:28:33,190
observe all these time
to fail samples, and take

572
00:28:33,190 --> 00:28:36,870
the mean of that, that tells
you a mean time to failure.

573
00:28:36,870 --> 00:28:38,300
The reason this
is interesting is

574
00:28:38,300 --> 00:28:40,960
that you could run your system
for a really long period

575
00:28:40,960 --> 00:28:44,620
of time, and build up a
metric called availability.

576
00:28:49,800 --> 00:28:52,380
So, for example, if
you're running a website,

577
00:28:52,380 --> 00:28:54,760
and the way this
website works is it

578
00:28:54,760 --> 00:28:57,640
runs for a while and then every
once in awhile it crashes.

579
00:28:57,640 --> 00:29:00,629
So its network crashes and
people can't get to you.

580
00:29:00,629 --> 00:29:02,670
So you could run this for
months or years on end,

581
00:29:02,670 --> 00:29:05,439
and then observe these values.

582
00:29:05,439 --> 00:29:06,730
You could run this every month.

583
00:29:06,730 --> 00:29:08,510
You could decide what
availability is, and decide

584
00:29:08,510 --> 00:29:09,926
if it's good enough
or if you want

585
00:29:09,926 --> 00:29:11,660
to make it higher or lower.

586
00:29:11,660 --> 00:29:14,310
So you could now define
your availability

587
00:29:14,310 --> 00:29:18,540
to be the fraction of time that
your system is up and running.

588
00:29:18,540 --> 00:29:21,050
And the fraction of time that
the system is up and running

589
00:29:21,050 --> 00:29:23,250
is the fraction of
time on this timeline

590
00:29:23,250 --> 00:29:29,240
that you have this
kind of shaded thing.

591
00:29:29,240 --> 00:29:31,740
OK, so that's just equal
to the sum of all the time

592
00:29:31,740 --> 00:29:35,970
to failure numbers
divided by the total time.

593
00:29:35,970 --> 00:29:38,540
And the total time is
just the sum of all

594
00:29:38,540 --> 00:29:40,260
the TTF's and the TTR's.

595
00:29:46,150 --> 00:29:49,880
OK, and that's what availability
means is the fraction of time

596
00:29:49,880 --> 00:29:54,030
that your system
is available is up.

597
00:29:54,030 --> 00:29:57,120
Now, if you divide both the
top and the bottom by N,

598
00:29:57,120 --> 00:30:06,560
this number works out to
be the mean time to failure

599
00:30:06,560 --> 00:30:10,410
divided by the mean time to
failure plus the mean time

600
00:30:10,410 --> 00:30:10,930
to repair.

601
00:30:18,420 --> 00:30:20,890
So this is a useful notion
because now it tells you that

602
00:30:20,890 --> 00:30:23,640
you can [point?] your system
for a very long period of time,

603
00:30:23,640 --> 00:30:26,994
and build up a mean estimate,
mean values of the time

604
00:30:26,994 --> 00:30:28,410
to failure and the
time to repair,

605
00:30:28,410 --> 00:30:32,140
and just come up with the
notion of what the availability

606
00:30:32,140 --> 00:30:33,020
of the system is.

607
00:30:33,020 --> 00:30:35,850
And then, decide based on
whether it's high enough or not

608
00:30:35,850 --> 00:30:38,762
whether you want to improve
some aspect of the system

609
00:30:38,762 --> 00:30:39,970
and whether it's worth doing.

610
00:30:44,540 --> 00:30:47,820
So it turns out this
mean time to failure,

611
00:30:47,820 --> 00:30:52,840
and therefore availability
is related for components

612
00:30:52,840 --> 00:30:56,300
to a notion called
the failure rate.

613
00:30:56,300 --> 00:30:58,270
So let me define
the failure rate.

614
00:31:09,200 --> 00:31:12,840
So the failure rate
is defined, it's

615
00:31:12,840 --> 00:31:14,460
also called a hazard function.

616
00:31:14,460 --> 00:31:16,930
That's what people use the
term H of T, the hazard rate.

617
00:31:16,930 --> 00:31:18,510
That's defined as
the probability

618
00:31:18,510 --> 00:31:25,780
that you have a failure of a
system or a component in time,

619
00:31:25,780 --> 00:31:31,950
T to T plus delta T,
given that it's working,

620
00:31:31,950 --> 00:31:36,360
we'll say, OK, at time T, OK?

621
00:31:36,360 --> 00:31:37,910
So it's a conditional
probability.

622
00:31:37,910 --> 00:31:40,160
It's the probability that
you'll fail in the next time

623
00:31:40,160 --> 00:31:42,702
instant given that it's
correctly working and has

624
00:31:42,702 --> 00:31:43,660
been correctly working.

625
00:31:43,660 --> 00:31:46,120
It's correctly
working at time T.

626
00:31:46,120 --> 00:31:48,720
So, if you look at
this for a disk,

627
00:31:48,720 --> 00:31:52,270
most discs look like the
picture shown up here.

628
00:31:52,270 --> 00:31:53,900
This is also called
the bathtub curve

629
00:31:53,900 --> 00:31:56,480
because it looks like a bathtub.

630
00:31:56,480 --> 00:31:58,960
What you see at the left
end here are new discs.

631
00:31:58,960 --> 00:32:01,490
So, the X axis here shows time.

632
00:32:01,490 --> 00:32:03,319
I guess it's a
little shifted below.

633
00:32:03,319 --> 00:32:05,610
You can't read some stuff
that's written at the bottom.

634
00:32:05,610 --> 00:32:07,840
But the X axis shows
time, and the Y axis

635
00:32:07,840 --> 00:32:09,300
shows the failure rate.

636
00:32:09,300 --> 00:32:11,740
So, when you take a new
component like a new light bulb

637
00:32:11,740 --> 00:32:14,970
or a new disc or anything new,
there is a pretty high chance

638
00:32:14,970 --> 00:32:18,730
that it'll actually fail because
manufacturers, when they sell

639
00:32:18,730 --> 00:32:21,120
you stuff, don't
actually sell you things

640
00:32:21,120 --> 00:32:23,411
without actually
burning them in first.

641
00:32:23,411 --> 00:32:25,410
So for semiconductors,
that's also called yield.

642
00:32:25,410 --> 00:32:27,160
They make a whole
number of chips,

643
00:32:27,160 --> 00:32:28,860
and then they're
burning a few, and then

644
00:32:28,860 --> 00:32:30,026
they only give you the rest.

645
00:32:32,480 --> 00:32:34,680
And the fraction that
survives the burning

646
00:32:34,680 --> 00:32:36,570
is also called the yield.

647
00:32:36,570 --> 00:32:39,180
So what you see on our left,
the colorful term for that

648
00:32:39,180 --> 00:32:41,330
is infant mortality.

649
00:32:41,330 --> 00:32:45,270
So it's things that die when
they are really, really young.

650
00:32:45,270 --> 00:32:48,220
And then, once you get
past the early mortality,

651
00:32:48,220 --> 00:32:52,660
you end up with a flat failure,
a conditional probability

652
00:32:52,660 --> 00:32:53,950
for failure.

653
00:32:53,950 --> 00:32:57,100
And what this says
is that, and I'll

654
00:32:57,100 --> 00:32:58,490
get to this and a little bit.

655
00:32:58,490 --> 00:33:01,370
But what this says is that once
you are in the flat region,

656
00:33:01,370 --> 00:33:03,360
it says that the
probability of failure

657
00:33:03,360 --> 00:33:06,110
is essentially independent of
what's happened in the past.

658
00:33:09,240 --> 00:33:11,170
And then you stay
here for a while.

659
00:33:11,170 --> 00:33:13,721
And then if the system
has been operating

660
00:33:13,721 --> 00:33:15,470
like a disk has been
operating for awhile,

661
00:33:15,470 --> 00:33:17,970
let's say three years or five
years typically for discs,

662
00:33:17,970 --> 00:33:19,590
then the probability
of failure starts

663
00:33:19,590 --> 00:33:22,630
going up again because that's
usually due to wear and tear,

664
00:33:22,630 --> 00:33:26,750
which for hardware components
is certainly the case.

665
00:33:28,724 --> 00:33:30,390
There are a couple
of interesting things

666
00:33:30,390 --> 00:33:33,640
about this curve that you
should realize, particularly

667
00:33:33,640 --> 00:33:37,120
when you read specifications
for things like discs.

668
00:33:37,120 --> 00:33:39,540
Disc manufacturers
will report a number,

669
00:33:39,540 --> 00:33:41,982
like the mean time
to failure number.

670
00:33:41,982 --> 00:33:43,440
And the mean time
to failure number

671
00:33:43,440 --> 00:33:44,981
that the report
might usually, I mean

672
00:33:44,981 --> 00:33:48,767
for discs might be 200,000
hours or 300,000 hours.

673
00:33:48,767 --> 00:33:50,600
I mean, that's a really
long period of time.

674
00:33:50,600 --> 00:33:52,780
That's 30 years.

675
00:33:52,780 --> 00:33:54,460
So when you look at
a number like that,

676
00:33:54,460 --> 00:33:56,750
you have to ask
whether what it means

677
00:33:56,750 --> 00:33:58,390
is that discs really
survive 30 years.

678
00:33:58,390 --> 00:34:00,970
And anybody who
is on the computer

679
00:34:00,970 --> 00:34:06,750
knows, you know, most discs
don't survive 30 years.

680
00:34:06,750 --> 00:34:09,250
So they are actually
reporting one

681
00:34:09,250 --> 00:34:12,219
over the reciprocal of this
thing at the flat region

682
00:34:12,219 --> 00:34:14,830
of the curve because this
conditional failure probability

683
00:34:14,830 --> 00:34:19,750
rate, at this operation
time when the only reason

684
00:34:19,750 --> 00:34:21,940
things fail is completely
random failures

685
00:34:21,940 --> 00:34:24,060
not related to wear and tear.

686
00:34:24,060 --> 00:34:26,120
So when disc manufacturers
report a mean time

687
00:34:26,120 --> 00:34:28,520
to failure number, they are
actually reporting something

688
00:34:28,520 --> 00:34:33,719
that that's the time that
you're disc is likely to work.

689
00:34:33,719 --> 00:34:37,710
What that number really says is
that during the period of time

690
00:34:37,710 --> 00:34:39,550
that the disc is
normally working,

691
00:34:39,550 --> 00:34:41,989
the probability of
a random failure

692
00:34:41,989 --> 00:34:45,345
is one over the mean
time to failure.

693
00:34:45,345 --> 00:34:46,469
That's what it really says.

694
00:34:46,469 --> 00:34:49,960
So the other number that
they also report, often

695
00:34:49,960 --> 00:34:53,462
in smaller print, is it the
expected operational lifetime.

696
00:34:53,462 --> 00:34:55,920
And that's usually something
like three years or four years

697
00:34:55,920 --> 00:34:58,310
or five years, whatever
it is they report.

698
00:34:58,310 --> 00:35:00,490
And that's where this
thing starts going up,

699
00:35:00,490 --> 00:35:04,066
and beyond a point where
the probability of failures

700
00:35:04,066 --> 00:35:05,440
above some threshold,
they report

701
00:35:05,440 --> 00:35:09,410
that as the expected
operational lifetime.

702
00:35:09,410 --> 00:35:12,390
Now, for software, this
curve doesn't actually apply,

703
00:35:12,390 --> 00:35:16,119
or at least nobody really knows
what the curve is for software.

704
00:35:16,119 --> 00:35:18,410
What is true for software,
though, is infant mortality,

705
00:35:18,410 --> 00:35:20,493
things were the conditional
probability of failure

706
00:35:20,493 --> 00:35:23,420
is high for new software, which
is why you are sort of well

707
00:35:23,420 --> 00:35:25,800
advised, the moment the
new upgrade of something

708
00:35:25,800 --> 00:35:28,250
comes around, most
people who are prudent

709
00:35:28,250 --> 00:35:31,352
wait a little bit to just make
sure all the bugs are out,

710
00:35:31,352 --> 00:35:32,810
and things get a
little bit stable.

711
00:35:32,810 --> 00:35:33,893
So they wait a few months.

712
00:35:33,893 --> 00:35:36,260
You are always a couple
of revisions behind.

713
00:35:36,260 --> 00:35:39,300
So I do believe that for
software, the left side

714
00:35:39,300 --> 00:35:40,530
of the curve holds.

715
00:35:40,530 --> 00:35:42,640
It's totally unclear that
there is a flat region,

716
00:35:42,640 --> 00:35:44,210
and it's totally
unclear that things

717
00:35:44,210 --> 00:35:50,240
start rising again with age.

718
00:35:50,240 --> 00:35:52,420
So the reason for
this curve, being

719
00:35:52,420 --> 00:35:55,950
the way it is, is a lot of
this is based on the fact

720
00:35:55,950 --> 00:35:58,060
that things are mechanical
and have wear and tear.

721
00:35:58,060 --> 00:35:59,830
But the motivation
for this kind of curve

722
00:35:59,830 --> 00:36:04,120
actually comes from demographics
and from human lifespans.

723
00:36:04,120 --> 00:36:07,380
So this is a picture
that I got from,

724
00:36:07,380 --> 00:36:10,030
it's a website
called mortality.org,

725
00:36:10,030 --> 00:36:17,010
which is a research project
run by demographers.

726
00:36:17,010 --> 00:36:18,430
And they have amazing data.

727
00:36:18,430 --> 00:36:21,460
There's way more data available
but human life expectancy

728
00:36:21,460 --> 00:36:24,800
and demographics than
anything about software.

729
00:36:24,800 --> 00:36:27,100
What this shows here is
actually the same bathtub curve

730
00:36:27,100 --> 00:36:28,360
as in the previous chart.

731
00:36:28,360 --> 00:36:30,480
It just doesn't look
like that because the Y

732
00:36:30,480 --> 00:36:32,940
axis is on a log scale.

733
00:36:32,940 --> 00:36:35,280
So given that it's
rising linearly

734
00:36:35,280 --> 00:36:38,730
between 0.001 and 0.01,
on a linear scale that

735
00:36:38,730 --> 00:36:40,390
looks essentially flat.

736
00:36:40,390 --> 00:36:43,175
So human beings for the
probability of death,

737
00:36:43,175 --> 00:36:44,550
at a certain time,
given that you

738
00:36:44,550 --> 00:36:44,660
are alive at a
certain time, that

739
00:36:44,660 --> 00:36:46,868
follows this curve here,
essentially a bathtub curve.

740
00:36:46,868 --> 00:36:49,520
At the left hand, of course,
there is infant mortality.

741
00:36:49,520 --> 00:36:53,100
I think I pulled the data down.

742
00:36:53,100 --> 00:36:55,539
I think I pulled the data down.

743
00:36:55,539 --> 00:36:57,080
This is from an
article that appeared

744
00:36:57,080 --> 00:37:00,380
where the data for the
US population 1999.

745
00:37:00,380 --> 00:37:02,720
It starts off again
with infant mortality.

746
00:37:02,720 --> 00:37:04,890
And then it's flat for a while.

747
00:37:04,890 --> 00:37:07,114
Then it rises up.

748
00:37:07,114 --> 00:37:08,530
Now, there's a lot
of controversy,

749
00:37:08,530 --> 00:37:12,230
it turns out, for whether the
bathtub curve at the right end

750
00:37:12,230 --> 00:37:13,520
holds for human beings or not.

751
00:37:13,520 --> 00:37:14,895
And, some people
believe it does,

752
00:37:14,895 --> 00:37:16,490
and some people
believe it doesn't.

753
00:37:16,490 --> 00:37:18,720
But the point here is that
for human beings anyway,

754
00:37:18,720 --> 00:37:22,470
the rule of thumb that insurance
companies use for determining

755
00:37:22,470 --> 00:37:27,200
insurance premiums is
that the log of the death

756
00:37:27,200 --> 00:37:30,740
rate, the log of the probability
of dying in a certain age

757
00:37:30,740 --> 00:37:35,986
grows linearly with the time
that somebody has been alive.

758
00:37:35,986 --> 00:37:37,360
And that's what
this graph shows,

759
00:37:37,360 --> 00:37:40,710
that on [large?] scale on
the Y axis, you have a line.

760
00:37:40,710 --> 00:37:43,950
And that's what they use for
determining insurance premiums.

761
00:37:47,310 --> 00:37:51,700
OK, so the reason this bathtub
curve is actually useful

762
00:37:51,700 --> 00:37:56,180
is, so if you go back,
let's go back here.

763
00:37:56,180 --> 00:37:57,890
The reason both these
numbers are useful,

764
00:37:57,890 --> 00:37:59,520
the flat portion of
the bathtub curve

765
00:37:59,520 --> 00:38:02,190
and the expected operational
lifetime is the following.

766
00:38:02,190 --> 00:38:04,090
It's not like this flat
portion of the curve

767
00:38:04,090 --> 00:38:06,360
where the disc manufacturer
reports the mean time

768
00:38:06,360 --> 00:38:06,860
to failure.

769
00:38:06,860 --> 00:38:08,000
That's 30 years.

770
00:38:08,000 --> 00:38:10,384
It's not like
that's useless even

771
00:38:10,384 --> 00:38:12,800
though you're disc only might
run for three to four years.

772
00:38:12,800 --> 00:38:17,390
The reason is that if you
have a project, if you have

773
00:38:17,390 --> 00:38:19,850
a system where you
are willing to upgrade

774
00:38:19,850 --> 00:38:22,020
your disk every three
years where you've budgeted

775
00:38:22,020 --> 00:38:26,070
for upgrading your discs
every said three years, then

776
00:38:26,070 --> 00:38:30,290
you might be better off buying a
disk whose expected lifetime is

777
00:38:30,290 --> 00:38:36,230
only five years but whose
flat portion is really low.

778
00:38:36,230 --> 00:38:40,050
So in particular, if you're
given to discs, one of which

779
00:38:40,050 --> 00:38:41,970
has a curve that
looks like that,

780
00:38:41,970 --> 00:38:49,350
and another that has a curve
that looks like that, and let's

781
00:38:49,350 --> 00:38:55,410
say this is five years,
and this is three years.

782
00:38:55,410 --> 00:38:57,290
If you're building
a system and you've

783
00:38:57,290 --> 00:38:59,740
budgeted for upgrading your
discs every four years,

784
00:38:59,740 --> 00:39:03,080
then you're probably
better off using the thing

785
00:39:03,080 --> 00:39:05,510
with the lower
value of mean time

786
00:39:05,510 --> 00:39:07,822
to failure because its
expected lifetime is longer.

787
00:39:07,822 --> 00:39:10,030
But if you're willing to
upgrade your discs every two

788
00:39:10,030 --> 00:39:12,730
years or one year,
then you might

789
00:39:12,730 --> 00:39:15,225
be better off with this thing
here with the lower meantime

790
00:39:15,225 --> 00:39:17,600
to failure, even though its
expected operational lifetime

791
00:39:17,600 --> 00:39:18,240
is smaller.

792
00:39:18,240 --> 00:39:20,300
So both of these numbers
are actually meaningful,

793
00:39:20,300 --> 00:39:24,111
and it depends a lot on how
you're planning to use it.

794
00:39:24,111 --> 00:39:26,110
I mean, it's a lot like
spare tires on your car.

795
00:39:26,110 --> 00:39:28,310
I mean, the spare tire
was run perfectly fine

796
00:39:28,310 --> 00:39:30,560
as long as you don't
exceed 100 miles.

797
00:39:30,560 --> 00:39:32,280
And the moment you
exceed 100 miles, then

798
00:39:32,280 --> 00:39:33,750
you don't want to use it at all.

799
00:39:33,750 --> 00:39:35,650
And it might be a lot cheaper
to build the spare tire

800
00:39:35,650 --> 00:39:37,660
that runs just 100 miles
because the users, you

801
00:39:37,660 --> 00:39:45,910
are guaranteed that you
will get to a repair shop

802
00:39:45,910 --> 00:39:53,160
is in 100 miles.

803
00:39:53,160 --> 00:40:00,400
It's the same concept.

804
00:40:00,400 --> 00:40:02,024
OK.

805
00:40:02,024 --> 00:40:03,690
So one of the things
that we can define,

806
00:40:03,690 --> 00:40:06,140
once we have this
condition of failure rate

807
00:40:06,140 --> 00:40:08,820
is the reliability
of the system.

808
00:40:12,180 --> 00:40:16,597
We'll define that as
the probability, R of T,

809
00:40:16,597 --> 00:40:18,430
is the probability that
the system's working

810
00:40:18,430 --> 00:40:22,520
at time T, given that it
was working at time zero,

811
00:40:22,520 --> 00:40:24,840
or more generally assuming
that everything is always

812
00:40:24,840 --> 00:40:27,610
working at time zero,
it's the probability

813
00:40:27,610 --> 00:40:37,220
that you're OK at time T.

814
00:40:37,220 --> 00:40:40,630
And it turns out that for
components in the flat region

815
00:40:40,630 --> 00:40:45,160
of this curve, H of T, the
conditional failure rate is

816
00:40:45,160 --> 00:40:50,260
a constant, on systems
that satisfy that,

817
00:40:50,260 --> 00:40:53,020
and would satisfy the property
that the actual unconditional

818
00:40:53,020 --> 00:40:56,070
failure rate is a [memory-less?]
process where the probability

819
00:40:56,070 --> 00:40:58,980
of failure doesn't depend on how
long the system's been running.

820
00:40:58,980 --> 00:41:00,563
It turns out that
for the systems that

821
00:41:00,563 --> 00:41:03,480
satisfy those conditions,
which apparently discs

822
00:41:03,480 --> 00:41:06,900
do in the operation
when they're actually

823
00:41:06,900 --> 00:41:09,730
not at the right edge of
the curve, which discs do,

824
00:41:09,730 --> 00:41:13,170
the reliability,
this function goes

825
00:41:13,170 --> 00:41:15,520
as the very nice,
simple function, which

826
00:41:15,520 --> 00:41:18,580
is an exponential decaying
function, E to the minus

827
00:41:18,580 --> 00:41:20,200
T over MTTF.

828
00:41:20,200 --> 00:41:21,710
And this is under
two conditions.

829
00:41:21,710 --> 00:41:24,340
H of T has to be flat, and
the unconditional failure rate

830
00:41:24,340 --> 00:41:26,256
has to be something that
doesn't depend on how

831
00:41:26,256 --> 00:41:27,880
long the system's been running.

832
00:41:27,880 --> 00:41:30,570
And for those systems,
it's not hard to show

833
00:41:30,570 --> 00:41:33,110
that your reliability is
just an exponential decaying

834
00:41:33,110 --> 00:41:35,110
function, which means you
can do a lot of things

835
00:41:35,110 --> 00:41:37,530
like predict how long the
system is likely to be running,

836
00:41:37,530 --> 00:41:39,490
and so on.

837
00:41:39,490 --> 00:41:43,760
And that will tell you
when to upgrade things.

838
00:41:43,760 --> 00:41:48,090
OK, so given all of
this stuff, we now

839
00:41:48,090 --> 00:41:50,610
want techniques to cope with
failures, cope with faults.

840
00:41:50,610 --> 00:41:52,318
And that's what we're
going to be looking

841
00:41:52,318 --> 00:41:55,100
at for the next
few lectures, let's

842
00:41:55,100 --> 00:42:00,537
take one simple example
of a system first.

843
00:42:00,537 --> 00:42:02,370
And like I said before,
all of these systems

844
00:42:02,370 --> 00:42:04,390
use redundancy in some form.

845
00:42:04,390 --> 00:42:06,650
So the disk fails
at a certain rate.

846
00:42:06,650 --> 00:42:09,510
Just put in multiple disks,
replicate the data across them,

847
00:42:09,510 --> 00:42:10,940
and then hope that
things survive.

848
00:42:14,280 --> 00:42:17,170
So the first kind of
redundancy that you

849
00:42:17,170 --> 00:42:20,080
might have in the
example that I just

850
00:42:20,080 --> 00:42:24,130
talked about, spatial
redundancy, where the idea is

851
00:42:24,130 --> 00:42:27,270
that you have multiple
copies of the same thing,

852
00:42:27,270 --> 00:42:29,790
and the games we're
going to play all

853
00:42:29,790 --> 00:42:33,790
have to do with how we're going
to manage all these copies.

854
00:42:33,790 --> 00:42:37,180
And actually, this will turn
out to be quite complicated.

855
00:42:37,180 --> 00:42:40,360
We'll use these special copies
in a number of different ways.

856
00:42:40,360 --> 00:42:43,420
In some examples, we'll
apply error correcting codes

857
00:42:43,420 --> 00:42:48,790
to make copies of the data or
use other codes to replicate

858
00:42:48,790 --> 00:42:49,290
the data.

859
00:42:52,660 --> 00:42:54,970
We might replicate
data and make copies

860
00:42:54,970 --> 00:42:57,219
of data in the form of
logs which keep track of,

861
00:42:57,219 --> 00:42:58,510
you know, you run an operation.

862
00:42:58,510 --> 00:42:59,780
You store some results.

863
00:42:59,780 --> 00:43:02,100
But at the same time, before
you store those results,

864
00:43:02,100 --> 00:43:04,840
you also store something
in a log [surf?],

865
00:43:04,840 --> 00:43:07,200
the original data went away;
your log can tell you what

866
00:43:07,200 --> 00:43:09,540
to do.

867
00:43:09,540 --> 00:43:13,300
Or you might just do
plain and simple copies

868
00:43:13,300 --> 00:43:18,030
followed by voting.

869
00:43:18,030 --> 00:43:21,550
So the idea is that you have
multiple copies of something,

870
00:43:21,550 --> 00:43:23,590
and then you write
to all of them.

871
00:43:23,590 --> 00:43:25,350
In the simplest of schemes,
you might write to all of them,

872
00:43:25,350 --> 00:43:26,740
and then when you want
to read something,

873
00:43:26,740 --> 00:43:28,110
you read from all of them.

874
00:43:28,110 --> 00:43:30,350
And then just what?

875
00:43:30,350 --> 00:43:31,690
And go with the majority.

876
00:43:31,690 --> 00:43:35,339
So intuitively that can tolerate
a certain number of failures.

877
00:43:35,339 --> 00:43:37,130
And all of these
approaches have been used.

878
00:43:37,130 --> 00:43:39,690
And people will continue
to build systems

879
00:43:39,690 --> 00:43:41,930
along all of these ideas.

880
00:43:41,930 --> 00:43:44,270
But in addition,
we're also going

881
00:43:44,270 --> 00:43:48,460
to look at temporal redundancy.

882
00:43:48,460 --> 00:43:51,850
And the idea here
is try it again.

883
00:43:51,850 --> 00:43:54,890
So this is different
from copies.

884
00:43:54,890 --> 00:43:56,440
What it says is
you try something.

885
00:43:56,440 --> 00:43:58,856
If it doesn't work and you
determine that it doesn't work,

886
00:43:58,856 --> 00:44:01,570
try it again.

887
00:44:01,570 --> 00:44:05,040
So retry is an example
of temporal tricks.

888
00:44:05,040 --> 00:44:08,070
But it will turn out will also
use not just moving forward

889
00:44:08,070 --> 00:44:10,840
and retrying something that
we know should be retried,

890
00:44:10,840 --> 00:44:12,700
we'll also use the
trick of undoing things

891
00:44:12,700 --> 00:44:14,130
that we have done.

892
00:44:14,130 --> 00:44:16,860
So we'll move both
directions on the time axis.

893
00:44:16,860 --> 00:44:19,170
We'll retry stuff,
but at the same time

894
00:44:19,170 --> 00:44:22,630
we'll also undo things because
sometimes things have happened

895
00:44:22,630 --> 00:44:24,280
that shouldn't have happened.

896
00:44:24,280 --> 00:44:25,380
Things went half way.

897
00:44:25,380 --> 00:44:28,170
And we really want
to back things out.

898
00:44:28,170 --> 00:44:31,725
And we were going to use
both of these techniques.

899
00:44:35,980 --> 00:44:41,830
So one example of spatial
redundancy is a voting scheme.

900
00:44:46,360 --> 00:44:50,730
And you can apply this to many
different kinds of systems.

901
00:44:50,730 --> 00:44:53,800
But let's just apply it
to a simple example of,

902
00:44:53,800 --> 00:44:55,970
there is data stored
in multiple occasions.

903
00:44:55,970 --> 00:44:57,624
And then whenever
data is written,

904
00:44:57,624 --> 00:44:58,790
it's written to all of them.

905
00:44:58,790 --> 00:45:01,160
And then when you read it,
you read from all them,

906
00:45:01,160 --> 00:45:02,530
and then you vote.

907
00:45:05,700 --> 00:45:07,960
And in a simple model
where these components are

908
00:45:07,960 --> 00:45:10,660
fail stop, which means that
they fail; they just fail.

909
00:45:13,210 --> 00:45:17,910
Excuse me, on a simple
model where things are not

910
00:45:17,910 --> 00:45:21,550
fail stop or fail fast, but just
report back this data to you,

911
00:45:21,550 --> 00:45:24,035
so you are voting on it,
and these results come back.

912
00:45:24,035 --> 00:45:26,660
You've written something in them
and when you read things back,

913
00:45:26,660 --> 00:45:29,180
arbitrary values might get
returned if there's a failure.

914
00:45:29,180 --> 00:45:31,830
And if there's no failure here,
correct values get returned.

915
00:45:31,830 --> 00:45:34,572
Then as long as two of these
copies are correctly working,

916
00:45:34,572 --> 00:45:36,530
or two of these versions
are correctly working,

917
00:45:36,530 --> 00:45:38,238
then the vote will
actually return to you

918
00:45:38,238 --> 00:45:40,034
at the correct output.

919
00:45:40,034 --> 00:45:41,450
And that's the
idea behind voting.

920
00:45:41,450 --> 00:45:46,500
So if the reliability of each
of these components is some R,

921
00:45:46,500 --> 00:45:49,070
that's the probability that
the system's working at time T

922
00:45:49,070 --> 00:45:51,814
according to that
definition of reliability.

923
00:45:51,814 --> 00:45:53,980
Then, under the assumption
that these are completely

924
00:45:53,980 --> 00:45:55,810
independent of each other,
which is a big assumption,

925
00:45:55,810 --> 00:45:57,250
particularly for software.

926
00:45:57,250 --> 00:45:59,590
But it might be a reasonable
assumption for something

927
00:45:59,590 --> 00:46:01,430
like a disk, under the
assumption that these

928
00:46:01,430 --> 00:46:04,080
are completely independent,
then you could write out

929
00:46:04,080 --> 00:46:07,180
the reliability of this
three-voting scheme

930
00:46:07,180 --> 00:46:11,212
of this thing where you are
voting on three outputs.

931
00:46:11,212 --> 00:46:12,920
But you know that the
system is correctly

932
00:46:12,920 --> 00:46:16,230
working if any two of these
are correctly working.

933
00:46:16,230 --> 00:46:18,660
So that happens
under two conditions.

934
00:46:18,660 --> 00:46:22,930
Firstly, all three are
correctly working, right?

935
00:46:22,930 --> 00:46:25,050
Or, some two of the three
are correctly working.

936
00:46:25,050 --> 00:46:27,466
And, there's three ways in
which you could choose some two

937
00:46:27,466 --> 00:46:29,410
of the three.

938
00:46:29,410 --> 00:46:33,020
And, one of them
is wrongly working.

939
00:46:33,020 --> 00:46:34,870
And it turns out that
this number actually

940
00:46:34,870 --> 00:46:39,870
is very, very large, much larger
than R, when R is close to one.

941
00:46:39,870 --> 00:46:44,140
And, in general,
this is bigger than R

942
00:46:44,140 --> 00:46:47,170
when each of the components
has high enough reliability,

943
00:46:47,170 --> 00:46:48,290
namely, bigger than half.

944
00:46:54,650 --> 00:46:57,390
And so, let's say that
each of these components

945
00:46:57,390 --> 00:47:00,670
has a reliability of 95%.

946
00:47:00,670 --> 00:47:02,830
If you work this number
out, it turns out

947
00:47:02,830 --> 00:47:04,830
to be a pretty big number,
much higher than 95%,

948
00:47:04,830 --> 00:47:06,810
much closer to one.

949
00:47:06,810 --> 00:47:08,884
And, of course,
this kind of voting

950
00:47:08,884 --> 00:47:11,050
is a bad idea if the
reliability of these components

951
00:47:11,050 --> 00:47:11,840
is really low.

952
00:47:11,840 --> 00:47:14,492
I mean, if it's below one
half, then chances are that

953
00:47:14,492 --> 00:47:16,700
you're more likely the two
of them are just [wrong?],

954
00:47:16,700 --> 00:47:18,170
and you agree on that result.

955
00:47:18,170 --> 00:47:22,780
And it turns out to reduce
the reliability of the system.

956
00:47:22,780 --> 00:47:26,170
Now, in general, you might think
that you can build systems out

957
00:47:26,170 --> 00:47:30,010
of this basic voting idea,
and for various reasons

958
00:47:30,010 --> 00:47:32,310
it turns out that this idea
has limited applicability

959
00:47:32,310 --> 00:47:34,840
for the kinds of
things we want to do.

960
00:47:34,840 --> 00:47:36,950
And a lot of that
stems from the fact

961
00:47:36,950 --> 00:47:39,649
that these are not, in
general, in computer systems.

962
00:47:39,649 --> 00:47:41,940
It's very hard to design
components that are completely

963
00:47:41,940 --> 00:47:43,770
independent of each other.

964
00:47:43,770 --> 00:47:47,430
It might work out OK for certain
hardware components where

965
00:47:47,430 --> 00:47:49,480
you might do this
voting or other forms

966
00:47:49,480 --> 00:47:51,980
of spatial redundancy
that gives you

967
00:47:51,980 --> 00:47:55,130
these impressive
reliability numbers.

968
00:47:55,130 --> 00:47:57,490
But for software, this
independent assumption

969
00:47:57,490 --> 00:47:59,230
turns out to be
really hard to meet.

970
00:47:59,230 --> 00:48:01,230
And there is an approach to
building software like this.

971
00:48:01,230 --> 00:48:02,646
It's called N
version programming.

972
00:48:02,646 --> 00:48:04,550
And it's still a
topic of research

973
00:48:04,550 --> 00:48:08,681
where people are trying to build
software systems out of voting.

974
00:48:08,681 --> 00:48:11,180
But you have to pay a lot of
attention and care to make sure

975
00:48:11,180 --> 00:48:13,270
that these software
components that

976
00:48:13,270 --> 00:48:16,030
are doing the same function
are actually independent,

977
00:48:16,030 --> 00:48:18,400
maybe written by
different people running

978
00:48:18,400 --> 00:48:20,320
on different operating
systems, and so on.

979
00:48:20,320 --> 00:48:23,800
And that turns out to be a
pretty expensive undertaking.

980
00:48:23,800 --> 00:48:25,510
It's still sometimes
necessary if you

981
00:48:25,510 --> 00:48:27,400
want to build something
highly reliable.

982
00:48:27,400 --> 00:48:31,620
But because of its
cost it's not something

983
00:48:31,620 --> 00:48:34,190
that is the sort of
cookie-cutter technique

984
00:48:34,190 --> 00:48:36,862
for achieving highly
reliable software systems.

985
00:48:36,862 --> 00:48:39,070
And so what we're going to
see starting for next time

986
00:48:39,070 --> 00:48:40,986
is a somewhat different
approach for achieving

987
00:48:40,986 --> 00:48:43,742
software reliability that
doesn't rely on voting, which

988
00:48:43,742 --> 00:48:45,950
won't actually achieve the
same degree of reliability

989
00:48:45,950 --> 00:48:49,642
as these kinds of
systems, but will achieve

990
00:48:49,642 --> 00:48:51,600
a different kind of
reliability that we'll talk

991
00:48:51,600 --> 00:48:54,310
about starting from next time.