1
00:00:00,780 --> 00:00:03,350
Good afternoon.

2
00:00:03,350 --> 00:00:06,302
So we're going to continue
our discussion about atomicity

3
00:00:06,302 --> 00:00:07,510
and how to achieve atomicity.

4
00:00:07,510 --> 00:00:09,030
And today the focus
is going to be

5
00:00:09,030 --> 00:00:11,860
on implementing this idea
called recoverability,

6
00:00:11,860 --> 00:00:15,440
which we just described
and defined the last time.

7
00:00:15,440 --> 00:00:18,710
So if you recall from
last time, the idea

8
00:00:18,710 --> 00:00:23,550
is that when you have modules
that interact with one another,

9
00:00:23,550 --> 00:00:27,590
and in this example M1 calls
M2, and M2 fails somewhere

10
00:00:27,590 --> 00:00:30,800
in the middle of this
invocation and it recovers,

11
00:00:30,800 --> 00:00:32,670
the goal here is
to try to make sure

12
00:00:32,670 --> 00:00:36,420
that the invoker of this
module, in this case M1,

13
00:00:36,420 --> 00:00:38,950
or all subsequent
invokers of M1,

14
00:00:38,950 --> 00:00:41,260
don't see any
partial results that

15
00:00:41,260 --> 00:00:45,670
were computed during this
execution when M2 failed.

16
00:00:45,670 --> 00:00:49,320
And this was the idea that
we called recoverability.

17
00:00:49,320 --> 00:00:50,900
And the definition
of recoverability

18
00:00:50,900 --> 00:00:54,100
was that an action,
which is made up

19
00:00:54,100 --> 00:00:56,340
of a composite
sequence of steps is

20
00:00:56,340 --> 00:00:59,380
recoverable from the point
of view of its invoker,

21
00:00:59,380 --> 00:01:03,480
if it looks to the invoker and
to all subsequent invokers as

22
00:01:03,480 --> 00:01:05,900
if this action either
completely occurred,

23
00:01:05,900 --> 00:01:08,770
or if it didn't completely
occur and aborted, it

24
00:01:08,770 --> 00:01:12,360
aborted in such a way that all
partial effects of that action

25
00:01:12,360 --> 00:01:14,770
were undone or backed out.

26
00:01:14,770 --> 00:01:16,270
So in other words,
recoverability is

27
00:01:16,270 --> 00:01:19,110
this idea that you
either do it all,

28
00:01:19,110 --> 00:01:21,730
either complete the action,
or do none of the action.

29
00:01:21,730 --> 00:01:23,560
But the effects
are as if you were

30
00:01:23,560 --> 00:01:26,850
able to back out of the action.

31
00:01:26,850 --> 00:01:29,510
And we use this
idea to then talk

32
00:01:29,510 --> 00:01:33,800
about a particular special
case of [NOISE OBSCURES]

33
00:01:33,800 --> 00:01:36,720
to implement a
recoverable sector, which

34
00:01:36,720 --> 00:01:39,680
is a single sector
of a disk where

35
00:01:39,680 --> 00:01:43,820
what we were able to do was to
ensure that everybody reading,

36
00:01:43,820 --> 00:01:46,710
we defined a put procedure
and a get procedure.

37
00:01:46,710 --> 00:01:48,820
So, readers wouldn't
[UNINTELLIGIBLE].

38
00:01:48,820 --> 00:01:51,750
And we ensure that
everybody doing a get

39
00:01:51,750 --> 00:01:54,470
would never see the
partial results of any put.

40
00:01:54,470 --> 00:01:57,020
So, if a failure were to
happen in the middle of a put,

41
00:01:57,020 --> 00:02:01,060
people doing a get wouldn't
see these partial results.

42
00:02:01,060 --> 00:02:05,000
And, the main idea here
was to actually maintain

43
00:02:05,000 --> 00:02:08,509
what is more generally known as
a shadow version, or a shadow

44
00:02:08,509 --> 00:02:11,690
copy, or a shadow
object of the data,

45
00:02:11,690 --> 00:02:14,620
and we maintained two
versions of the data

46
00:02:14,620 --> 00:02:16,520
that we call D0 and D1.

47
00:02:16,520 --> 00:02:21,550
And, we maintain
a sector that we

48
00:02:21,550 --> 00:02:24,410
call the chooser sector to
choose between the two shadows.

49
00:02:24,410 --> 00:02:29,250
And, what we were able to argue
was that this chooser always

50
00:02:29,250 --> 00:02:32,090
points to the version that
you want people to get from

51
00:02:32,090 --> 00:02:35,510
to read from, and so
when someone does a put,

52
00:02:35,510 --> 00:02:38,000
the idea is first to write
to the version that's

53
00:02:38,000 --> 00:02:39,540
not currently being read from.

54
00:02:39,540 --> 00:02:40,800
So the chooser points to zero.

55
00:02:40,800 --> 00:02:45,927
Then the putter would put
data, write data into one.

56
00:02:45,927 --> 00:02:48,260
And if the failure happened
in the middle of that write,

57
00:02:48,260 --> 00:02:50,010
there's no problem
because people who read

58
00:02:50,010 --> 00:02:52,050
would still read from zero.

59
00:02:52,050 --> 00:02:53,630
And we reduce this
case of proving

60
00:02:53,630 --> 00:02:55,430
this algorithm
correct to the case

61
00:02:55,430 --> 00:02:58,201
when a failure happened in the
middle of writing the chooser

62
00:02:58,201 --> 00:02:58,700
sector.

63
00:02:58,700 --> 00:03:01,820
And we were able to argue
that as long as people,

64
00:03:01,820 --> 00:03:04,070
if a failure happened in
the middle of writing here,

65
00:03:04,070 --> 00:03:05,910
either of these
versions is correct

66
00:03:05,910 --> 00:03:07,730
because a failure
by definition didn't

67
00:03:07,730 --> 00:03:10,429
happen in the middle of writing
either of these two sectors.

68
00:03:10,429 --> 00:03:12,220
And therefore you could
pick either of them

69
00:03:12,220 --> 00:03:15,250
and read from it.

70
00:03:15,250 --> 00:03:17,740
And during this
process, we came up

71
00:03:17,740 --> 00:03:20,630
with this notion which
we're going to generalize

72
00:03:20,630 --> 00:03:21,990
today called a commit point.

73
00:03:27,220 --> 00:03:30,030
The commit point is the point
at which for any action,

74
00:03:30,030 --> 00:03:33,160
the results are visible
to subsequent actions.

75
00:03:33,160 --> 00:03:35,800
And if a failure happens
before the commit point,

76
00:03:35,800 --> 00:03:37,650
then the idea is,
in general, you

77
00:03:37,650 --> 00:03:41,190
would not want people not to
see the partial results that

78
00:03:41,190 --> 00:03:44,335
might have accumulated
before the failure occurred.

79
00:03:44,335 --> 00:03:46,210
And in this particular
case, the commit point

80
00:03:46,210 --> 00:03:49,770
is when the chooser sector gets
written to the current version

81
00:03:49,770 --> 00:03:50,420
of the data.

82
00:03:50,420 --> 00:03:54,050
And that call to writing
the chooser sector returns.

83
00:03:54,050 --> 00:03:58,250
And if it returns, then you
know that people doing a get

84
00:03:58,250 --> 00:04:00,360
will get from the version
that just got written.

85
00:04:00,360 --> 00:04:02,590
So, in the implementation
of recoverable put,

86
00:04:02,590 --> 00:04:05,960
the commit point was
when this call returned.

87
00:04:08,940 --> 00:04:13,750
So now, the question for today
is how we deal with larger

88
00:04:13,750 --> 00:04:16,910
actions --

89
00:04:16,910 --> 00:04:23,140
-- because this is a plan that
works pretty well for single

90
00:04:23,140 --> 00:04:24,660
sector puts and gets.

91
00:04:24,660 --> 00:04:27,680
So, we were able to make
individual sector reads

92
00:04:27,680 --> 00:04:29,450
and writes recoverable.

93
00:04:29,450 --> 00:04:32,760
But if you think about any
serious application or even any

94
00:04:32,760 --> 00:04:35,600
[toy?] application, in most
cases you end up having more

95
00:04:35,600 --> 00:04:38,700
data than what fits
into one single sector.

96
00:04:38,700 --> 00:04:41,060
And, you might have
things touching data all

97
00:04:41,060 --> 00:04:41,920
over the place.

98
00:04:44,750 --> 00:04:50,360
And, our approach to doing this
is to actually first define

99
00:04:50,360 --> 00:04:53,120
what a programmer
must do, what somebody

100
00:04:53,120 --> 00:04:56,740
wishing to write a program
that is a recoverable action

101
00:04:56,740 --> 00:04:57,480
must do.

102
00:04:57,480 --> 00:04:59,859
And then we're going to
implement that underneath

103
00:04:59,859 --> 00:05:01,400
in a system so the
programmer doesn't

104
00:05:01,400 --> 00:05:04,760
have to worry about
implementing recoverability.

105
00:05:04,760 --> 00:05:07,940
So the idea here is
for the programmer

106
00:05:07,940 --> 00:05:11,410
of a recoverable action, to
start writing that action using

107
00:05:11,410 --> 00:05:15,660
a system call, a call that they
call begin recoverable action,

108
00:05:15,660 --> 00:05:19,230
and then discipline
herself or himself

109
00:05:19,230 --> 00:05:21,920
to write some software which
has a small number of rules

110
00:05:21,920 --> 00:05:25,120
as to what can go in here.

111
00:05:25,120 --> 00:05:27,190
And then, explicitly,
when they want

112
00:05:27,190 --> 00:05:29,680
to commit that
recoverable action,

113
00:05:29,680 --> 00:05:32,100
make its results visible
to subsequent actions,

114
00:05:32,100 --> 00:05:33,060
invoke commit.

115
00:05:36,870 --> 00:05:39,570
And then, they are allowed
to do a little bit more work,

116
00:05:39,570 --> 00:05:41,090
or a lot of work here.

117
00:05:41,090 --> 00:05:44,260
But, there's very
strict restrictions

118
00:05:44,260 --> 00:05:46,370
on what they can
do after a commit.

119
00:05:46,370 --> 00:05:52,289
And then, they can end using
end recoverable action.

120
00:05:52,289 --> 00:05:53,830
And this phase here
before the commit

121
00:05:53,830 --> 00:05:55,121
is called the pre-commit phase.

122
00:05:55,121 --> 00:05:56,680
This is the post-commit phase.

123
00:05:56,680 --> 00:05:59,200
And the idea here is
if a failure occurred

124
00:05:59,200 --> 00:06:04,730
here or an abort occurred before
the commit and this action

125
00:06:04,730 --> 00:06:06,920
was made to abort,
then the system

126
00:06:06,920 --> 00:06:09,980
must restore the state
of all of the variables,

127
00:06:09,980 --> 00:06:13,710
and all of the data that was
touched here to the same state

128
00:06:13,710 --> 00:06:15,770
before this action
even got invoked.

129
00:06:15,770 --> 00:06:17,270
OK, it's as if not
of this happened.

130
00:06:17,270 --> 00:06:20,970
So this is the not at all
part of this definition

131
00:06:20,970 --> 00:06:22,391
of recoverability.

132
00:06:22,391 --> 00:06:24,390
Once you reach this point
of the commit returns,

133
00:06:24,390 --> 00:06:26,056
the only thing you're
allowed to do here

134
00:06:26,056 --> 00:06:28,046
are things that cause
you to complete.

135
00:06:28,046 --> 00:06:29,420
You're not allowed
to abort here.

136
00:06:29,420 --> 00:06:30,920
You're not allowed
to back out here.

137
00:06:30,920 --> 00:06:33,420
So once you reach
the point, it means

138
00:06:33,420 --> 00:06:37,930
you're in the do it all part
of do it all or none at all.

139
00:06:37,930 --> 00:06:41,110
So you have to complete
all the way to the end.

140
00:06:41,110 --> 00:06:43,100
And what this really
means is that all

141
00:06:43,100 --> 00:06:45,342
of the data that you
want to manipulate,

142
00:06:45,342 --> 00:06:47,550
and all of the resources
that you want to accumulate,

143
00:06:47,550 --> 00:06:50,049
and we'll look at locks as a
resource that you would like to

144
00:06:50,049 --> 00:06:52,710
accumulate in order to
enforce isolation, which

145
00:06:52,710 --> 00:06:54,590
is a topic for
next time, all that

146
00:06:54,590 --> 00:06:57,050
has to happen here so that
once you reach this point

147
00:06:57,050 --> 00:06:59,540
and it ends, then even if
a failure occurs when it

148
00:06:59,540 --> 00:07:01,980
restarts, you just
have to crunch through

149
00:07:01,980 --> 00:07:03,900
and finish what
was going on here.

150
00:07:03,900 --> 00:07:04,970
And that can just happen.

151
00:07:04,970 --> 00:07:06,966
There's nothing to
acquire, no resources

152
00:07:06,966 --> 00:07:08,840
to get all of the data
variables have already

153
00:07:08,840 --> 00:07:13,500
been put in their correct
situation in the correct state.

154
00:07:13,500 --> 00:07:14,960
So the interesting
part really is

155
00:07:14,960 --> 00:07:17,260
what happens between the
begin recoverable action

156
00:07:17,260 --> 00:07:19,050
and until the commit finishes.

157
00:07:19,050 --> 00:07:22,610
And that's really what
we're going to focus on.

158
00:07:22,610 --> 00:07:25,370
Now in addition to commit,
there is another call

159
00:07:25,370 --> 00:07:27,950
that we have to explicitly
think about, and that's abort.

160
00:07:31,356 --> 00:07:32,980
And there's two or
three different ways

161
00:07:32,980 --> 00:07:34,700
in which abort may be invoked.

162
00:07:34,700 --> 00:07:38,590
The first is a program that
might herself or himself have

163
00:07:38,590 --> 00:07:40,030
abort in their code.

164
00:07:40,030 --> 00:07:42,360
For example, in that bank
transfer application,

165
00:07:42,360 --> 00:07:45,030
if you discover that your
savings account doesn't

166
00:07:45,030 --> 00:07:48,140
have enough funds to cover
a transfer, you read it,

167
00:07:48,140 --> 00:07:49,770
and then you maybe
write something,

168
00:07:49,770 --> 00:07:52,329
and then you discover that you
don't have the funds to cover

169
00:07:52,329 --> 00:07:52,870
the transfer.

170
00:07:52,870 --> 00:07:54,380
You might just abort.

171
00:07:54,380 --> 00:07:57,460
And the semantics of
abort are that once abort

172
00:07:57,460 --> 00:07:59,420
is called by the
programmer, they

173
00:07:59,420 --> 00:08:02,720
can be guaranteed that when
the next person invokes

174
00:08:02,720 --> 00:08:07,040
a recoverable action that
involves the same data items,

175
00:08:07,040 --> 00:08:11,470
those readers will see the same
state as if this action never

176
00:08:11,470 --> 00:08:12,431
started.

177
00:08:12,431 --> 00:08:14,180
So what this means is
that the system must

178
00:08:14,180 --> 00:08:16,510
have a plan of undoing
and backing out

179
00:08:16,510 --> 00:08:18,240
of any changes that
might have occurred

180
00:08:18,240 --> 00:08:21,162
before this abort is called.

181
00:08:21,162 --> 00:08:22,620
Another reason an
abort might occur

182
00:08:22,620 --> 00:08:26,650
is that you're in a, for
example, database complication,

183
00:08:26,650 --> 00:08:28,490
and you're booking
all sorts of things

184
00:08:28,490 --> 00:08:32,240
like plane tickets, and
air tickets, and hotel

185
00:08:32,240 --> 00:08:33,640
reservations, and so on.

186
00:08:33,640 --> 00:08:35,970
And you book a few
of them and then

187
00:08:35,970 --> 00:08:38,809
you discover you can't get
one of the reservations

188
00:08:38,809 --> 00:08:39,490
that you want.

189
00:08:39,490 --> 00:08:42,480
You might as a user might
abort the whole transaction.

190
00:08:42,480 --> 00:08:45,970
And that causes all the
individual things that

191
00:08:45,970 --> 00:08:48,999
are in partial state to abort.

192
00:08:48,999 --> 00:08:50,540
Another reason why
abort might happen

193
00:08:50,540 --> 00:08:52,590
is that, and we'll see
this the next time when

194
00:08:52,590 --> 00:08:55,440
we talk about locking,
anytime you have locks,

195
00:08:55,440 --> 00:08:58,250
we already saw that
anytime you have locks you

196
00:08:58,250 --> 00:09:00,120
have the danger of deadlock.

197
00:09:00,120 --> 00:09:02,080
In one way in which
the system implementing

198
00:09:02,080 --> 00:09:06,780
these atomic actions, both for
isolation in particular, deals

199
00:09:06,780 --> 00:09:09,950
with deadlocks is when
two or more actions are

200
00:09:09,950 --> 00:09:12,940
waiting for each other, waiting
on locks that the others hold,

201
00:09:12,940 --> 00:09:15,430
you just abort one of them,
or abort as many of them

202
00:09:15,430 --> 00:09:17,160
as needed for
progress to happen.

203
00:09:17,160 --> 00:09:19,130
So the system might
unilaterally decide

204
00:09:19,130 --> 00:09:20,404
to abort certain actions.

205
00:09:20,404 --> 00:09:22,820
And, what that means is that
the systems' abort had better

206
00:09:22,820 --> 00:09:25,410
have a plan to undo all
partial changes that

207
00:09:25,410 --> 00:09:29,360
might have occurred before
it returns from abort.

208
00:09:32,090 --> 00:09:34,274
OK, so that's the general model.

209
00:09:34,274 --> 00:09:35,690
So what we're going
to do today is

210
00:09:35,690 --> 00:09:41,360
to understand what happens
when data variables are written

211
00:09:41,360 --> 00:09:44,270
inside one of these
recoverable actions:

212
00:09:44,270 --> 00:09:47,070
how come it's implemented,
and how abort is implemented.

213
00:09:47,070 --> 00:09:48,160
And that's the plan.

214
00:09:48,160 --> 00:09:49,636
And, once we do
that, we will have

215
00:09:49,636 --> 00:09:50,760
implemented recoverability.

216
00:09:50,760 --> 00:09:56,420
So we're going to study two
solutions to this problem.

217
00:09:56,420 --> 00:10:05,390
And the first solution uses an
idea called version histories.

218
00:10:05,390 --> 00:10:07,940
And version histories
really build on an idea

219
00:10:07,940 --> 00:10:11,970
that we did see last
time when we talked

220
00:10:11,970 --> 00:10:14,810
about recoverable sector,
which is this rule that we call

221
00:10:14,810 --> 00:10:17,250
the golden rule of
recoverability, which

222
00:10:17,250 --> 00:10:20,340
says never modify the only
copy because if you modify

223
00:10:20,340 --> 00:10:23,002
the only copy of something and
a failure occurs, then you don't

224
00:10:23,002 --> 00:10:24,460
really have a way
of backing it out

225
00:10:24,460 --> 00:10:27,920
because you don't know what
the original value was.

226
00:10:27,920 --> 00:10:29,560
Version histories
generalize the idea

227
00:10:29,560 --> 00:10:32,880
to say, never modify anything.

228
00:10:32,880 --> 00:10:35,140
So the idea is anytime you
want to write a variable,

229
00:10:35,140 --> 00:10:37,090
you don't actually
overwrite anything.

230
00:10:37,090 --> 00:10:38,870
You create another
version of the variable

231
00:10:38,870 --> 00:10:41,860
and somehow arrange
for the set of pointers

232
00:10:41,860 --> 00:10:44,740
that, for a variable
to point to all

233
00:10:44,740 --> 00:10:47,130
of the versions of
any given variable.

234
00:10:47,130 --> 00:10:50,480
And to understand that, we need
to understand the difference

235
00:10:50,480 --> 00:10:54,220
between conventional storage,
like a conventional variable

236
00:10:54,220 --> 00:10:59,380
that is also called a cell
store or a cell storage item,

237
00:10:59,380 --> 00:11:03,400
and a variable that allows you
to implement versions which

238
00:11:03,400 --> 00:11:08,710
we're going to call
[a journal?] based storage.

239
00:11:08,710 --> 00:11:10,750
So, cell storage is
traditional storage.

240
00:11:10,750 --> 00:11:13,080
So if you have a variable,
X, that's cell storage

241
00:11:13,080 --> 00:11:18,250
and you set X to some value,
V, what ends up happening

242
00:11:18,250 --> 00:11:22,860
is that the cell that
contains X is you

243
00:11:22,860 --> 00:11:25,101
write the value, V, into X.

244
00:11:25,101 --> 00:11:27,600
In other words, you overwrite
whatever [there is?] you know,

245
00:11:27,600 --> 00:11:28,957
and replace it with V.

246
00:11:28,957 --> 00:11:31,540
And, this overwriting really is
what causes the problem if you

247
00:11:31,540 --> 00:11:35,740
don't have another copy of this
variable somehow maintained,

248
00:11:35,740 --> 00:11:38,590
overwriting means that this
rule of recoverabilities

249
00:11:38,590 --> 00:11:41,570
is being violated.

250
00:11:41,570 --> 00:11:45,260
We're going to use the word
install for these writes.

251
00:11:45,260 --> 00:11:48,500
So we'll be installing
items into cell stores.

252
00:11:48,500 --> 00:11:52,370
So what that means is assigning
a value to a cell store

253
00:11:52,370 --> 00:11:52,870
variable.

254
00:11:56,730 --> 00:11:59,520
And the problem is this gets
in the way of the golden rule.

255
00:11:59,520 --> 00:12:06,620
So what were going to do is
use these cell storage items

256
00:12:06,620 --> 00:12:09,170
that we know how to build
that's the memory abstraction

257
00:12:09,170 --> 00:12:12,060
to build an expanded
version called a journal

258
00:12:12,060 --> 00:12:15,217
storage of generalized
storage in which nothing

259
00:12:15,217 --> 00:12:16,050
is ever overwritten.

260
00:12:18,615 --> 00:12:19,990
The way this works
is that if you

261
00:12:19,990 --> 00:12:26,960
have X, the very first time
you set X to some value,

262
00:12:26,960 --> 00:12:30,880
you end up creating a data
structure in cell storage

263
00:12:30,880 --> 00:12:33,550
that looks like this.

264
00:12:33,550 --> 00:12:35,480
You have a value of V1.

265
00:12:35,480 --> 00:12:38,830
And you also keep track of the
identifier of the action that

266
00:12:38,830 --> 00:12:39,460
created that.

267
00:12:39,460 --> 00:12:40,520
And, that'll turn
out to be useful

268
00:12:40,520 --> 00:12:42,810
for us to know the identifiers
of the actions that

269
00:12:42,810 --> 00:12:46,044
created any given variable.

270
00:12:46,044 --> 00:12:47,460
And how you get
these identifiers?

271
00:12:47,460 --> 00:12:51,180
When [begin RA?] is called,
it returns an ID, OK,

272
00:12:51,180 --> 00:12:53,170
and the system knows that.

273
00:12:53,170 --> 00:12:57,630
And this ID is available
to the program as well.

274
00:12:57,630 --> 00:13:00,880
Then the next version, if
X gets set by any action

275
00:13:00,880 --> 00:13:05,260
to a different value, what you
do is you created that as V2.

276
00:13:05,260 --> 00:13:08,740
And, you keep track of the
identifier that maintains that.

277
00:13:08,740 --> 00:13:10,939
And then you got V3,
and so on, all the way.

278
00:13:10,939 --> 00:13:12,730
And the current version,
the latest version

279
00:13:12,730 --> 00:13:16,920
might be VN that
was written by IDN.

280
00:13:16,920 --> 00:13:18,570
Now if the same
action repeatedly

281
00:13:18,570 --> 00:13:21,160
writes the same variable,
you just create new versions.

282
00:13:21,160 --> 00:13:23,440
So it isn't like there's
one version per action.

283
00:13:23,440 --> 00:13:25,640
It's just that there's
one version every time

284
00:13:25,640 --> 00:13:26,580
you write something.

285
00:13:26,580 --> 00:13:30,182
So literally, nothing
is overwritten.

286
00:13:30,182 --> 00:13:30,890
And so, that's X.

287
00:13:30,890 --> 00:13:35,580
So, X itself points to the
head version, the very latest

288
00:13:35,580 --> 00:13:36,700
version that was written.

289
00:13:36,700 --> 00:13:38,130
And, you could
imagine that there

290
00:13:38,130 --> 00:13:42,075
are these pointers pulling
you back like a link list.

291
00:13:42,075 --> 00:13:44,450
But the nice thing about it
is this is the journal store.

292
00:13:44,450 --> 00:13:46,550
So, X itself is
this whole thing.

293
00:13:50,580 --> 00:13:53,570
And, we'll implement two
calls that when you have,

294
00:13:53,570 --> 00:13:56,130
this is basically a
memory abstraction.

295
00:13:56,130 --> 00:13:59,020
So, you need to read
and you need to write.

296
00:13:59,020 --> 00:14:01,410
So, for write, we're
going to come up

297
00:14:01,410 --> 00:14:06,200
with a call called write
journal, which in the notes

298
00:14:06,200 --> 00:14:08,740
I think has a slightly
different name.

299
00:14:08,740 --> 00:14:10,640
I think they call
it write new value.

300
00:14:10,640 --> 00:14:14,410
But write journal makes it clear
that it's for journal store.

301
00:14:14,410 --> 00:14:16,280
And, this is easy.

302
00:14:16,280 --> 00:14:18,650
It's some data item, X.

303
00:14:18,650 --> 00:14:20,470
It's some value, V.

304
00:14:20,470 --> 00:14:24,620
And, it's the ID of the
action that's doing the write.

305
00:14:24,620 --> 00:14:26,080
And this is very
easy to implement.

306
00:14:26,080 --> 00:14:28,110
All you do is you
create a new version.

307
00:14:28,110 --> 00:14:32,250
And then you take the current
thing that X is pointing to,

308
00:14:32,250 --> 00:14:35,950
and make the current version's
next pointer point to that.

309
00:14:35,950 --> 00:14:38,210
And then you make X
point to the new version.

310
00:14:38,210 --> 00:14:42,080
So, it's just a
link list thing, OK?

311
00:14:42,080 --> 00:14:45,000
And, in addition
to write journal,

312
00:14:45,000 --> 00:14:48,750
we obviously need to
implement read journal.

313
00:14:55,430 --> 00:14:59,270
And read journal is going to
take a data item that you wish

314
00:14:59,270 --> 00:15:02,130
to read, X, and for reasons
that will become clearer

315
00:15:02,130 --> 00:15:04,810
in a minute, it also takes
the ID of the action that

316
00:15:04,810 --> 00:15:08,220
wants to do the read, OK?

317
00:15:08,220 --> 00:15:09,980
So if you want to
read something,

318
00:15:09,980 --> 00:15:13,250
the idea is going to be the
following: the idea is going

319
00:15:13,250 --> 00:15:17,340
to be that some of these
actions are actions;

320
00:15:17,340 --> 00:15:20,550
some of these versions are
going to have been written

321
00:15:20,550 --> 00:15:23,330
by actions that were committed.

322
00:15:23,330 --> 00:15:25,540
OK, and some of
these actions were

323
00:15:25,540 --> 00:15:27,570
going to have been
written by actions

324
00:15:27,570 --> 00:15:32,080
that started writing things and
then maybe failed or aborted.

325
00:15:32,080 --> 00:15:34,887
So they never committed.

326
00:15:34,887 --> 00:15:36,470
Now, clearly when
you do read journal,

327
00:15:36,470 --> 00:15:38,761
you don't want to see the
results of those actions that

328
00:15:38,761 --> 00:15:42,050
were never committed
because what you want

329
00:15:42,050 --> 00:15:44,280
to see from the definition
that we laid out

330
00:15:44,280 --> 00:15:46,370
are once you reach
the commit point,

331
00:15:46,370 --> 00:15:48,130
you want to see the
change is visible.

332
00:15:48,130 --> 00:15:50,832
Before that, you don't
want anything visible.

333
00:15:50,832 --> 00:15:52,540
So as long as you can
keep track of which

334
00:15:52,540 --> 00:15:55,270
of these actions committed, and
which of these didn't commit,

335
00:15:55,270 --> 00:15:58,480
you can implement read journal
by starting at the most

336
00:15:58,480 --> 00:16:01,660
recent version,
and going backwards

337
00:16:01,660 --> 00:16:05,340
until you find the first
version that corresponds

338
00:16:05,340 --> 00:16:10,400
to a value that was written by
an action that was committed.

339
00:16:10,400 --> 00:16:13,670
So what you need to do is start
from here and look at IDN.

340
00:16:13,670 --> 00:16:17,200
If IDN, you need to maintain
another table that tells you

341
00:16:17,200 --> 00:16:19,664
whether IDN committed or not.

342
00:16:19,664 --> 00:16:21,330
If it committed, then
return that value.

343
00:16:21,330 --> 00:16:23,140
If not, go back one.

344
00:16:23,140 --> 00:16:25,850
And, keep going until you
find the most recent version

345
00:16:25,850 --> 00:16:29,700
that was written by
a committed action.

346
00:16:29,700 --> 00:16:31,880
If you do that, then
read journal clearly

347
00:16:31,880 --> 00:16:35,220
returns to you what
you would want,

348
00:16:35,220 --> 00:16:36,860
which is the value
that was written

349
00:16:36,860 --> 00:16:39,832
by the last committed action.

350
00:16:39,832 --> 00:16:41,540
The only other tweak
that you want to do,

351
00:16:41,540 --> 00:16:44,040
and the reason why ID is
passed as an argument read

352
00:16:44,040 --> 00:16:46,629
journal is if the current
action has already written,

353
00:16:46,629 --> 00:16:48,420
so let's say you are
implementing an action

354
00:16:48,420 --> 00:16:51,750
and you set the
value of X to 17,

355
00:16:51,750 --> 00:16:53,750
then when you read
the value of X,

356
00:16:53,750 --> 00:16:55,369
you would want the
value that you set.

357
00:16:55,369 --> 00:16:57,660
I mean, you wouldn't want
the previous committed action

358
00:16:57,660 --> 00:17:00,800
that's one way of
defining read journal.

359
00:17:00,800 --> 00:17:05,180
So as you go from the most
recent version to the oldest

360
00:17:05,180 --> 00:17:08,510
version, you either look see
whether the value that you

361
00:17:08,510 --> 00:17:11,569
are reading now is a value that
you set, your own action set.

362
00:17:11,569 --> 00:17:13,079
And if it was, just return that.

363
00:17:13,079 --> 00:17:14,912
And then, it'll return
to you the last value

364
00:17:14,912 --> 00:17:16,592
that this action set.

365
00:17:16,592 --> 00:17:18,050
Otherwise, you keep
going until you

366
00:17:18,050 --> 00:17:22,300
find the value set by the
most recent committed action.

367
00:17:22,300 --> 00:17:25,550
And since we aren't dealing here
with concurrent actions at all,

368
00:17:25,550 --> 00:17:31,056
right, we've already said last
time that, until next Monday,

369
00:17:31,056 --> 00:17:33,430
we're only going to be dealing
with one action at a time.

370
00:17:33,430 --> 00:17:35,030
There's no concurrent actions.

371
00:17:35,030 --> 00:17:37,856
Clearly, this algorithm
will be correct.

372
00:17:37,856 --> 00:17:39,480
You start from the
most recent version,

373
00:17:39,480 --> 00:17:42,530
keep going until you find the
first version that was either

374
00:17:42,530 --> 00:17:44,740
[done?] by this action
that's doing the read,

375
00:17:44,740 --> 00:17:49,820
or the first version that
was written by an action that

376
00:17:49,820 --> 00:17:51,740
committed.

377
00:17:51,740 --> 00:17:55,540
So, clearly what this means
is that you need a table

378
00:17:55,540 --> 00:17:59,010
that you have to maintain
that stores the status

379
00:17:59,010 --> 00:18:00,140
of these different actions.

380
00:18:00,140 --> 00:18:02,410
It needs to store which
actions committed,

381
00:18:02,410 --> 00:18:04,355
and which actions didn't commit.

382
00:18:04,355 --> 00:18:06,730
And that's going to be done
using a data structure called

383
00:18:06,730 --> 00:18:07,730
the commit record table.

384
00:18:12,000 --> 00:18:13,980
And this is a very simple table.

385
00:18:13,980 --> 00:18:16,990
It just has ID1,
ID2, all the way down

386
00:18:16,990 --> 00:18:18,300
to whatever ID's you have.

387
00:18:18,300 --> 00:18:22,130
Every time somebody calls begin
RA, you return them an ID,

388
00:18:22,130 --> 00:18:25,222
and then you create
this table that as soon

389
00:18:25,222 --> 00:18:27,680
as they create this action,
you set their state to pending,

390
00:18:27,680 --> 00:18:31,090
which I'll call P, OK?

391
00:18:31,090 --> 00:18:35,190
And, any time an action
commits, you replace this P

392
00:18:35,190 --> 00:18:38,160
with a C, which is
a commit record.

393
00:18:38,160 --> 00:18:41,960
OK, and once it's replaced
with a C for an action,

394
00:18:41,960 --> 00:18:47,130
this item is called the
commit record for an action.

395
00:18:47,130 --> 00:18:49,650
So now, when you want
to do read journal

396
00:18:49,650 --> 00:18:52,442
and you're looking to see
whether for any given action,

397
00:18:52,442 --> 00:18:54,400
things were committed,
the corresponding action

398
00:18:54,400 --> 00:18:56,460
is committed or not,
you look at this.

399
00:18:56,460 --> 00:18:57,360
You see its IDN.

400
00:18:57,360 --> 00:19:00,300
You look for IDN in this table,
C, if it's committed or not.

401
00:19:00,300 --> 00:19:02,980
If it's not committed, then
you go to the previous version

402
00:19:02,980 --> 00:19:04,310
and you do the same thing.

403
00:19:04,310 --> 00:19:10,400
If it's committed,
then you return it.

404
00:19:10,400 --> 00:19:12,910
Now, it's not actually clear
why you need this pending thing

405
00:19:12,910 --> 00:19:13,410
here.

406
00:19:13,410 --> 00:19:16,740
But it'll turn out that you will
require the pending thing when

407
00:19:16,740 --> 00:19:18,350
you deal with
isolation on Monday.

408
00:19:18,350 --> 00:19:20,530
So for now, you don't have
to worry about the fact

409
00:19:20,530 --> 00:19:24,350
that these pending
things are there, OK?

410
00:19:24,350 --> 00:19:28,990
Now, suppose an action
starts, and then it aborts.

411
00:19:28,990 --> 00:19:31,780
So I mentioned here that when
an action starts and it aborts,

412
00:19:31,780 --> 00:19:34,890
the system has to do some kind
of undoing of data in order

413
00:19:34,890 --> 00:19:36,560
for abort to be
correctly implemented.

414
00:19:36,560 --> 00:19:38,910
So, the state of the system's
restored to the state

415
00:19:38,910 --> 00:19:42,250
before the action even started.

416
00:19:42,250 --> 00:19:44,875
The nice thing about this way of
implementing version histories

417
00:19:44,875 --> 00:19:46,291
and read journal
is you don't have

418
00:19:46,291 --> 00:19:47,440
to do anything on an abort.

419
00:19:50,080 --> 00:19:53,250
If the application or
the system called abort,

420
00:19:53,250 --> 00:19:56,900
nothing has to be done because
read journal basically is just

421
00:19:56,900 --> 00:19:59,260
going scanning this
backward, looking

422
00:19:59,260 --> 00:20:01,770
for whether the version
was written by itself,

423
00:20:01,770 --> 00:20:05,080
that same action or looking for
whether the version was written

424
00:20:05,080 --> 00:20:06,750
by a committed action.

425
00:20:06,750 --> 00:20:09,350
So as long as you can
find for any given ID

426
00:20:09,350 --> 00:20:12,960
whether it was committed or
not, that's all you need.

427
00:20:12,960 --> 00:20:16,880
OK, but just for completeness,
and this will become useful

428
00:20:16,880 --> 00:20:22,960
the next time, all we'll do when
abort is called on an action,

429
00:20:22,960 --> 00:20:25,760
so abort takes the ID of
the action as an argument,

430
00:20:25,760 --> 00:20:29,940
all we'll do is we'll
replace, if ID7 aborts,

431
00:20:29,940 --> 00:20:32,340
we'll just replace the pending.

432
00:20:32,340 --> 00:20:35,510
We'll replace that
with an abort, OK?

433
00:20:35,510 --> 00:20:38,440
So, this commit
record table contains

434
00:20:38,440 --> 00:20:40,100
the status of the actions.

435
00:20:40,100 --> 00:20:44,990
And that status could either be
committed, pending, or aborted.

436
00:20:44,990 --> 00:20:46,270
When it starts, it's pending.

437
00:20:46,270 --> 00:20:51,350
And then it's pending as long as
either it aborts, in which case

438
00:20:51,350 --> 00:20:54,010
it aborted, or it's committed.

439
00:20:54,010 --> 00:20:56,832
Now, if it just fails and you
don't do anything about it,

440
00:20:56,832 --> 00:20:58,540
and there's no abort
call, it'll continue

441
00:20:58,540 --> 00:21:00,750
to remain in the pending state.

442
00:21:00,750 --> 00:21:02,870
But that's OK because
we're never really going

443
00:21:02,870 --> 00:21:04,896
to read the value
of anything that's

444
00:21:04,896 --> 00:21:07,270
the in the pending state that
was set by an action that's

445
00:21:07,270 --> 00:21:08,145
in the pending state.

446
00:21:12,210 --> 00:21:13,105
So is this clear?

447
00:21:16,280 --> 00:21:18,430
OK, this approach is
actually quite reasonable

448
00:21:18,430 --> 00:21:20,520
except that it has
a few problems.

449
00:21:20,520 --> 00:21:25,720
The first problem
it has is, well, it

450
00:21:25,720 --> 00:21:26,910
has two related problems.

451
00:21:26,910 --> 00:21:29,370
And that's the first class
of problems that it has is

452
00:21:29,370 --> 00:21:31,210
that although it looks
like we've really

453
00:21:31,210 --> 00:21:35,882
nailed this problem of achieving
recoverable storage using

454
00:21:35,882 --> 00:21:37,340
this journal storage
idea, building

455
00:21:37,340 --> 00:21:40,340
general recoverable actions so
that for any variable that's

456
00:21:40,340 --> 00:21:45,550
read inside here or read
inside a recoverable action,

457
00:21:45,550 --> 00:21:47,970
you use this general
storage idea.

458
00:21:47,970 --> 00:21:50,470
It's not quite correct
because you have to ask,

459
00:21:50,470 --> 00:21:55,840
what happens if the system
fails while the system is

460
00:21:55,840 --> 00:21:57,610
writing this commit record?

461
00:21:57,610 --> 00:21:59,260
So, the application
calls commit.

462
00:21:59,260 --> 00:22:01,910
The system's starting to
write this commit record

463
00:22:01,910 --> 00:22:04,620
and it fails.

464
00:22:04,620 --> 00:22:06,350
Or you might more
generally ask, what

465
00:22:06,350 --> 00:22:11,300
happens if I create this new
version in write journal,

466
00:22:11,300 --> 00:22:14,280
and as I'm creating a new
version of a variable,

467
00:22:14,280 --> 00:22:15,180
the system crashes.

468
00:22:15,180 --> 00:22:17,580
So some garbage
got written here.

469
00:22:17,580 --> 00:22:21,360
Or more likely, some garbage
got written not in here

470
00:22:21,360 --> 00:22:23,791
but as I was changing
this pointer for X

471
00:22:23,791 --> 00:22:25,290
to point to the
most recent version,

472
00:22:25,290 --> 00:22:26,190
some garbage got written.

473
00:22:26,190 --> 00:22:28,230
So, all subsequent reads
of X don't quite work.

474
00:22:31,730 --> 00:22:34,000
The answer to this
question is that we

475
00:22:34,000 --> 00:22:37,257
know how to solve this problem
because that question is

476
00:22:37,257 --> 00:22:38,090
basically identical.

477
00:22:38,090 --> 00:22:39,674
Both of these are identical.

478
00:22:39,674 --> 00:22:41,590
If we know how to solve
the problem of writing

479
00:22:41,590 --> 00:22:44,640
a single, recoverable sector,
a single, small item of data,

480
00:22:44,640 --> 00:22:47,070
then we know how to solve
these two problems because both

481
00:22:47,070 --> 00:22:50,520
of these are writing recoverably
a small amount of data.

482
00:22:50,520 --> 00:22:52,530
In one case, a
pointer that takes

483
00:22:52,530 --> 00:22:55,930
X to point to the most recent
version, in another case

484
00:22:55,930 --> 00:22:59,740
it's a single data
item that corresponds

485
00:22:59,740 --> 00:23:03,240
to the commit record in
this commit record table.

486
00:23:03,240 --> 00:23:08,140
And so this shows this
idea of bootstrap,

487
00:23:08,140 --> 00:23:11,610
that in order to build
this atomic action,

488
00:23:11,610 --> 00:23:14,975
this recoverable action, we end
up [SOUND OFF/THEN ON] and then

489
00:23:14,975 --> 00:23:17,350
you bootstrap on something
that we know already how to do

490
00:23:17,350 --> 00:23:17,460
because there are these cases
where you have to make sure

491
00:23:17,460 --> 00:23:17,520
that it writes to
certain pointers,

492
00:23:17,520 --> 00:23:17,600
and some table items
are done [commonly?].

493
00:23:17,600 --> 00:23:17,720
And we know how to do that
because we just told you

494
00:23:17,720 --> 00:23:17,780
how to do recoverable sectors.

495
00:23:17,780 --> 00:23:17,990
And you could just take
[UNINTELLIGIBLE] objects

496
00:23:17,990 --> 00:23:18,050
for these items, and
[UNINTELLIGIBLE PHRASE]

497
00:23:18,050 --> 00:23:18,470
to get this bootstrap.

498
00:23:18,470 --> 00:23:18,560
So that's the first thing,
the first [step problem?].

499
00:23:18,560 --> 00:23:18,650
There's another problem, not
so much a correctness problem,

500
00:23:18,650 --> 00:23:18,730
but a problem in general
using these version

501
00:23:18,730 --> 00:23:18,800
histories in order to
build recoverable actions.

502
00:23:18,800 --> 00:23:21,220
Any ideas on what that might be?

503
00:23:21,220 --> 00:23:25,490
Like, why would we
want to use this?

504
00:23:25,490 --> 00:23:27,630
Is this a space?

505
00:23:27,630 --> 00:23:32,440
Well, you kind of can't
really get around that.

506
00:23:32,440 --> 00:23:37,250
I mean, it's true that
there are these older

507
00:23:37,250 --> 00:23:39,920
versions that you keep forever.

508
00:23:39,920 --> 00:23:43,130
But, there are
organizations you can

509
00:23:43,130 --> 00:23:45,800
bring to bear that's
[UNINTELLIGIBLE]

510
00:23:45,800 --> 00:23:51,140
beneath these old version that
you can't really care about

511
00:23:51,140 --> 00:23:53,820
anymore because really
the [UNINTELLIGIBLE]

512
00:23:53,820 --> 00:23:57,020
requires, at least for
[UNINTELLIGIBLE PHRASE]

513
00:23:57,020 --> 00:24:01,300
about this when we talk
about isolation tomorrow.

514
00:24:01,300 --> 00:24:03,970
But really, the
[UNINTELLIGIBLE] only

515
00:24:03,970 --> 00:24:07,180
requires for a
single action case

516
00:24:07,180 --> 00:24:09,310
the last committed version.

517
00:24:09,310 --> 00:24:14,660
So, you could garbage collect
this stuff if you want.

518
00:24:14,660 --> 00:24:16,800
Yeah, it's really slow.

519
00:24:16,800 --> 00:24:20,000
So, for applications
where you care

520
00:24:20,000 --> 00:24:22,670
about performance, a
reasonable performance,

521
00:24:22,670 --> 00:24:25,880
[UNINTELLIGIBLE PHRASE]
this is really slow.

522
00:24:25,880 --> 00:24:30,160
And naturally, it's
not to say that this

523
00:24:30,160 --> 00:24:36,570
is a bad idea, an idea that
shouldn't be used at all.

524
00:24:36,570 --> 00:24:41,910
In fact, it's a perfectly
good idea for many cases

525
00:24:41,910 --> 00:24:45,120
where you might,
for various reasons,

526
00:24:45,120 --> 00:24:49,390
want to store restorative
records of old data

527
00:24:49,390 --> 00:24:54,740
and you don't care about fast
read or write performance.

528
00:24:54,740 --> 00:24:58,480
So it's perfectly good
for certain applications.

529
00:24:58,480 --> 00:25:02,654
But it's not good for
applications that want

530
00:25:02,654 --> 00:25:03,820
reasonably high-performance.

531
00:25:03,820 --> 00:25:08,100
And the reason that
this thing is small

532
00:25:08,100 --> 00:25:11,840
is because if you
think about it,

533
00:25:11,840 --> 00:25:17,720
it actually optimizes what you
might think of as uncommon case

534
00:25:17,720 --> 00:25:24,130
because what it ensures is that
when you fail and you recover,

535
00:25:24,130 --> 00:25:27,340
you have to do no work.

536
00:25:27,340 --> 00:25:32,150
So crash recovery is really
fast in this approach

537
00:25:32,150 --> 00:25:35,830
because there's nothing to
be done for crash recovery.

538
00:25:35,830 --> 00:25:39,482
But reads and writes are
slow because a read involves

539
00:25:39,482 --> 00:25:40,440
[traversing?] the list.

540
00:25:40,440 --> 00:25:44,070
A write involves
[UNINTELLIGIBLE  PHRASE].

541
00:25:44,070 --> 00:25:46,770
And so, it almost
optimizes the opposite

542
00:25:46,770 --> 00:25:47,730
of what you would want.

543
00:25:47,730 --> 00:25:49,105
If you want to
write performance,

544
00:25:49,105 --> 00:25:51,740
you want to form the principle
of optimizing the common case.

545
00:25:51,740 --> 00:25:59,480
And in order to optimize the
common case, what it means,

546
00:25:59,480 --> 00:26:08,760
what you want to do here is
to make the reads and writes

547
00:26:08,760 --> 00:26:13,760
really fast, and
maybe pay the penalty

548
00:26:13,760 --> 00:26:20,180
of a little bit of
extra turning in doing

549
00:26:20,180 --> 00:26:21,610
[UNINTELLIGIBLE PHRASE].

550
00:26:21,610 --> 00:26:23,750
It's working now?

551
00:26:23,750 --> 00:26:25,180
[LAUGHTER] Hello?

552
00:26:25,180 --> 00:26:27,320
All right, thanks.

553
00:26:27,320 --> 00:26:32,910
OK, so what you want to do
is optimize, whoa, it's loud.

554
00:26:32,910 --> 00:26:36,100
The integral of the volume
over time is correct.

555
00:26:43,670 --> 00:26:47,450
OK, so the solution
to this problem

556
00:26:47,450 --> 00:26:49,920
where we want to optimize
the common case of reads

557
00:26:49,920 --> 00:26:54,010
and writes, but we are
OK taking a bunch of time

558
00:26:54,010 --> 00:26:56,870
to do crash recovery is
an idea called logging.

559
00:27:04,890 --> 00:27:09,360
So the way to think of a log
is it's like a version history

560
00:27:09,360 --> 00:27:13,860
except you don't have a
version for each variable.

561
00:27:13,860 --> 00:27:17,930
You think of it as an Interleaf
version data structure

562
00:27:17,930 --> 00:27:21,520
that interleaves all the version
histories for all of the data

563
00:27:21,520 --> 00:27:24,860
that was ever written
during an action,

564
00:27:24,860 --> 00:27:27,100
during all of the
actions that ran.

565
00:27:27,100 --> 00:27:30,042
So what this means is that you
can write the log sequentially.

566
00:27:30,042 --> 00:27:31,750
And you've seen this
in yesterday's paper

567
00:27:31,750 --> 00:27:34,120
where they use logs for
a different application

568
00:27:34,120 --> 00:27:37,930
for high performance in a
file system for a system

569
00:27:37,930 --> 00:27:43,081
where writes normally
would incur a lot of seeks.

570
00:27:43,081 --> 00:27:44,330
But you can use the same idea.

571
00:27:44,330 --> 00:27:48,070
In this case, we're going to
use a log for crash recovery.

572
00:27:48,070 --> 00:27:50,670
But the fundamental property
of a log data structure

573
00:27:50,670 --> 00:27:53,760
is that it needs be
written only sequentially.

574
00:27:53,760 --> 00:27:56,090
And we know that disks
do that pretty fast.

575
00:27:56,090 --> 00:27:58,170
It's only when you have
to seek that and read

576
00:27:58,170 --> 00:28:00,600
small chunks of data with
seeks that you end up

577
00:28:00,600 --> 00:28:03,580
being really slow.

578
00:28:03,580 --> 00:28:08,360
So we're going to use cell
storage to satisfy our reads

579
00:28:08,360 --> 00:28:10,912
and writes.

580
00:28:10,912 --> 00:28:12,870
So all of those are going
to go to cell stores.

581
00:28:12,870 --> 00:28:14,911
[You don't read?] means
you just read a variable.

582
00:28:14,911 --> 00:28:16,930
You don't traverse any
link lists and writes.

583
00:28:16,930 --> 00:28:18,350
You don't create
any new versions.

584
00:28:18,350 --> 00:28:22,720
You just write into cell store.

585
00:28:22,720 --> 00:28:32,540
But then the log is going to be
stored on a nonvolatile medium

586
00:28:32,540 --> 00:28:33,620
such as a disk.

587
00:28:33,620 --> 00:28:36,690
And it's written sequentially.

588
00:28:45,500 --> 00:28:52,390
So once we have those two, our
plan is going to be as follows.

589
00:28:52,390 --> 00:28:55,440
And this plan is the
same plan that's adopted.

590
00:28:55,440 --> 00:28:58,450
Although there is
dozens of ways of doing

591
00:28:58,450 --> 00:29:01,340
log based crash recover,
they all essentially follow

592
00:29:01,340 --> 00:29:04,740
the same basic plan.

593
00:29:04,740 --> 00:29:07,180
You read and write
normally to cell storage.

594
00:29:07,180 --> 00:29:09,865
And you also write
a copy of what

595
00:29:09,865 --> 00:29:10,990
you're reading and writing.

596
00:29:10,990 --> 00:29:13,180
You write an encoding
of what you're writing,

597
00:29:13,180 --> 00:29:15,970
any updates that you
make into the log.

598
00:29:15,970 --> 00:29:18,130
OK, and we'll talk
in more detail

599
00:29:18,130 --> 00:29:20,050
about what you're exactly
right into the log

600
00:29:20,050 --> 00:29:22,530
and when you write
into the log, OK?

601
00:29:22,530 --> 00:29:25,460
So that allows us to follow this
golden rule of recoverability.

602
00:29:25,460 --> 00:29:28,050
It'll turn out that the
log is a copy of the data.

603
00:29:28,050 --> 00:29:30,800
So you always have two copies of
the data: one in cell storage,

604
00:29:30,800 --> 00:29:31,690
one on the log.

605
00:29:36,170 --> 00:29:39,340
So what happens when you fail?

606
00:29:39,340 --> 00:29:42,080
Well, when you fail, unlike in
the version history case where

607
00:29:42,080 --> 00:29:45,900
you could fail and restart, and
you don't have to do anything,

608
00:29:45,900 --> 00:29:52,750
here when you fail, the system
runs a recovery procedure.

609
00:29:52,750 --> 00:29:55,480
And that recovery procedure
recovers from the log

610
00:29:55,480 --> 00:29:57,384
that we have conveniently
arranged to write

611
00:29:57,384 --> 00:29:58,550
in the non-volatile storage.

612
00:29:58,550 --> 00:30:01,070
So, it remains
even after a crash,

613
00:30:01,070 --> 00:30:05,129
and it remains after
a crash recovers.

614
00:30:05,129 --> 00:30:07,670
And there are two things to do
while recovering from the log.

615
00:30:10,460 --> 00:30:15,819
For actions that didn't get to
finish the commit, for actions

616
00:30:15,819 --> 00:30:17,860
that were uncommitted,
which is this commit never

617
00:30:17,860 --> 00:30:21,230
return, what we have to
do is to look carefully

618
00:30:21,230 --> 00:30:26,100
to see whether the corresponding
cell store had any updates that

619
00:30:26,100 --> 00:30:27,360
were made to it.

620
00:30:27,360 --> 00:30:28,900
And it'll turn out
that the log is

621
00:30:28,900 --> 00:30:31,660
going to help us keep track
of what items were updated

622
00:30:31,660 --> 00:30:33,160
by any given action.

623
00:30:33,160 --> 00:30:35,220
And what we're going
to end up doing

624
00:30:35,220 --> 00:30:40,150
is for uncommitted actions,
we're going to back out.

625
00:30:43,969 --> 00:30:46,510
In other words, we're going to
undo any changes that it made,

626
00:30:46,510 --> 00:30:48,176
and the log is going
to help us do that.

627
00:30:51,470 --> 00:30:54,330
And conversely, for
committed actions,

628
00:30:54,330 --> 00:30:57,320
because the semantics we
want are that once committed,

629
00:30:57,320 --> 00:31:01,490
you would like the changes to
be visible to other people.

630
00:31:01,490 --> 00:31:03,610
For committed actions,
what you would like to do

631
00:31:03,610 --> 00:31:05,630
are to make sure
that the changes made

632
00:31:05,630 --> 00:31:07,820
by all committed
actions are in fact

633
00:31:07,820 --> 00:31:10,880
installed in the cell store.

634
00:31:10,880 --> 00:31:12,860
And what this means is
that if they turn out

635
00:31:12,860 --> 00:31:14,937
to not have been
installed, and we're

636
00:31:14,937 --> 00:31:17,520
going to use the log to tell us
whether they've been installed

637
00:31:17,520 --> 00:31:19,640
or not, we will
redo those actions.

638
00:31:25,720 --> 00:31:27,820
And, the second
thing we need to do

639
00:31:27,820 --> 00:31:31,720
is what happens if
an abort is called

640
00:31:31,720 --> 00:31:34,880
either by the application
or by the system.

641
00:31:34,880 --> 00:31:40,310
Well, in this case, what we
have to do is to use the log,

642
00:31:40,310 --> 00:31:41,780
and to keep track,
the log is going

643
00:31:41,780 --> 00:31:44,860
to help us keep track of the
changes made by this action

644
00:31:44,860 --> 00:31:46,760
to the cell store.

645
00:31:46,760 --> 00:31:49,459
The cell store itself doesn't
have an ocean of old or new

646
00:31:49,459 --> 00:31:50,500
because it's overwritten.

647
00:31:50,500 --> 00:31:52,270
So the log is going
to tell us that.

648
00:31:52,270 --> 00:31:53,890
And when abort is
called, we just

649
00:31:53,890 --> 00:31:58,120
want to back out by undoing the
changes of the current action.

650
00:32:04,447 --> 00:32:05,280
And that's the plan.

651
00:32:09,154 --> 00:32:10,820
So the first thing
we need to figure out

652
00:32:10,820 --> 00:32:12,070
is what this log looks like.

653
00:32:16,310 --> 00:32:18,180
So as we saw from
this discussion,

654
00:32:18,180 --> 00:32:21,120
the log is going to be required
for us to do two things.

655
00:32:21,120 --> 00:32:23,520
We're going to be undoing
things from the log,

656
00:32:23,520 --> 00:32:28,260
and we're going to be
redoing things from the log.

657
00:32:28,260 --> 00:32:32,440
So what that suggests is that
any time you update cell store,

658
00:32:32,440 --> 00:32:34,930
you change X from 17 to 25.

659
00:32:34,930 --> 00:32:36,980
What you'd really
like to maintain

660
00:32:36,980 --> 00:32:40,520
is what the value was before the
change was made so that you can

661
00:32:40,520 --> 00:32:43,280
undo if you need to,
and what the value

662
00:32:43,280 --> 00:32:46,940
is after the change was made so
that you can redo if you have

663
00:32:46,940 --> 00:32:50,230
to if by chance the
actual cell store didn't

664
00:32:50,230 --> 00:32:51,850
get written at the right time.

665
00:32:51,850 --> 00:32:54,450
So really the way to think
about logging base crash

666
00:32:54,450 --> 00:32:56,420
recover is that
the log is really

667
00:32:56,420 --> 00:32:59,100
the authoritative
version of the data.

668
00:32:59,100 --> 00:33:01,970
The cell store itself is you
should think of as a cache.

669
00:33:01,970 --> 00:33:03,424
And we've seen this idea before.

670
00:33:03,424 --> 00:33:05,340
The cell store you should
think of as a cache.

671
00:33:05,340 --> 00:33:07,630
If a failure happens,
you really have

672
00:33:07,630 --> 00:33:09,730
to be careful about
trusting the cell store.

673
00:33:09,730 --> 00:33:12,280
And, you don't trust
what's in the cell store.

674
00:33:12,280 --> 00:33:15,050
You start with a log,
and by selectively

675
00:33:15,050 --> 00:33:16,890
undoing certain
changes that were made

676
00:33:16,890 --> 00:33:19,090
and redoing certain
changes, you produce

677
00:33:19,090 --> 00:33:22,970
a more pristine, correct version
of the data, which corresponds

678
00:33:22,970 --> 00:33:25,430
to the changes made by all
the committed actions being

679
00:33:25,430 --> 00:33:29,120
visible, and the changes made
by all the uncommitted actions

680
00:33:29,120 --> 00:33:35,290
being wiped away to
the previous version.

681
00:33:35,290 --> 00:33:37,040
OK, so what does
the log look like?

682
00:33:37,040 --> 00:33:40,200
Well, as I've already said, a
log is like a version history

683
00:33:40,200 --> 00:33:41,900
except it interleaves
everything,

684
00:33:41,900 --> 00:33:42,960
and it's sequential.

685
00:33:42,960 --> 00:33:44,870
So it's really an
append-only data structure.

686
00:33:48,410 --> 00:33:58,240
And there's a few
different kinds of records

687
00:33:58,240 --> 00:34:00,110
that the log maintains.

688
00:34:00,110 --> 00:34:03,480
In particular, two are going
to be interesting to us.

689
00:34:03,480 --> 00:34:10,550
So there are two types of
records that we care about.

690
00:34:10,550 --> 00:34:14,860
The first type are
update records,

691
00:34:14,860 --> 00:34:18,750
which are written
to the log whenever

692
00:34:18,750 --> 00:34:22,050
a cell store item changes.

693
00:34:22,050 --> 00:34:25,860
So, if X goes from 17-25,
what you would write

694
00:34:25,860 --> 00:34:27,850
is an update record
that looks like this.

695
00:34:27,850 --> 00:34:31,989
You store the ID
of the transaction,

696
00:34:31,989 --> 00:34:35,560
sorry, ID of the recoverable
action that did the update.

697
00:34:35,560 --> 00:34:38,850
And then, you store two items.

698
00:34:38,850 --> 00:34:42,960
One of them is an undo item
or an undo action, actually.

699
00:34:42,960 --> 00:34:49,140
And, an undo that might
[save/say?], and a redo action.

700
00:34:54,610 --> 00:34:57,070
So what this means
here is that let's say

701
00:34:57,070 --> 00:34:59,550
that the actual
step of this action

702
00:34:59,550 --> 00:35:04,900
said X is assigned
to some value, new.

703
00:35:04,900 --> 00:35:06,500
In the log, what
you would write is

704
00:35:06,500 --> 00:35:09,220
keep track of old value,
the current value of X,

705
00:35:09,220 --> 00:35:12,021
and make that the undo step.

706
00:35:12,021 --> 00:35:14,020
And then, keep track of
the change that was made

707
00:35:14,020 --> 00:35:17,860
and make that the real step.

708
00:35:17,860 --> 00:35:21,600
So now, after doing this,
if the system were to fail,

709
00:35:21,600 --> 00:35:26,070
and this action 172 were
to never commit then

710
00:35:26,070 --> 00:35:28,250
you can systematically
start with the log,

711
00:35:28,250 --> 00:35:29,950
start with the latest
item in the log

712
00:35:29,950 --> 00:35:34,670
and go backwards, and
undo any changes made

713
00:35:34,670 --> 00:35:37,110
by actions that didn't commit.

714
00:35:37,110 --> 00:35:39,970
And conversely, and you might
need to do this as well,

715
00:35:39,970 --> 00:35:42,780
you might want to look at all
the actions that committed,

716
00:35:42,780 --> 00:35:45,480
and make sure that all those
actions, those individual steps

717
00:35:45,480 --> 00:35:48,735
in those actions are redone so
that once the crash recovers,

718
00:35:48,735 --> 00:35:50,360
you have a correct
version of the data.

719
00:35:53,240 --> 00:35:55,170
Now the other thing
that you will need,

720
00:35:55,170 --> 00:36:00,970
and you'll see why in a moment,
is another kind, a record

721
00:36:00,970 --> 00:36:07,214
and a log, which we're going
to call the outcome record.

722
00:36:07,214 --> 00:36:08,630
And this outcome
is the thing that

723
00:36:08,630 --> 00:36:11,554
keeps track of whether an
action committed or not.

724
00:36:11,554 --> 00:36:13,720
Remember I said you're going
to look through the log

725
00:36:13,720 --> 00:36:15,390
and figure out which
actions committed,

726
00:36:15,390 --> 00:36:16,580
and which didn't commit.

727
00:36:16,580 --> 00:36:18,220
You need to store
that somewhere.

728
00:36:18,220 --> 00:36:21,011
In particular, what that means
is that when an action commits,

729
00:36:21,011 --> 00:36:23,510
you had better make sure that
there is in it them in the log

730
00:36:23,510 --> 00:36:25,660
because the log really is
the only correct version

731
00:36:25,660 --> 00:36:26,900
of the data.

732
00:36:26,900 --> 00:36:29,100
So you have an outcome
record, and this

733
00:36:29,100 --> 00:36:31,490
has an ID of the action.

734
00:36:31,490 --> 00:36:34,470
It might be 174.

735
00:36:34,470 --> 00:36:39,550
And, there's a status
that might stay committed.

736
00:36:42,450 --> 00:36:45,230
And other values for the
status might be aborted

737
00:36:45,230 --> 00:36:49,250
is a possible value
of the status.

738
00:36:49,250 --> 00:36:50,770
Another is pending.

739
00:36:50,770 --> 00:36:56,770
So for various
reasons, what we will

740
00:36:56,770 --> 00:37:00,380
have is when begin recoverable
action returns with an ID,

741
00:37:00,380 --> 00:37:03,360
we will create a
log entry that says

742
00:37:03,360 --> 00:37:05,556
that this action has begun.

743
00:37:05,556 --> 00:37:06,930
So you might have
a begin record.

744
00:37:06,930 --> 00:37:10,200
It's not that important
to worry about for now.

745
00:37:10,200 --> 00:37:13,540
But the status of a committed
record and an aborted,

746
00:37:13,540 --> 00:37:17,300
and the update type are
important to understand.

747
00:37:20,220 --> 00:37:23,800
So once you have this
log structure understood,

748
00:37:23,800 --> 00:37:26,390
or the log data
structure understood,

749
00:37:26,390 --> 00:37:28,780
what you have to
think about our there

750
00:37:28,780 --> 00:37:31,520
are two questions that you
end up spending a lot of time

751
00:37:31,520 --> 00:37:35,430
thinking about in designing
these log-based protocols.

752
00:37:35,430 --> 00:37:37,830
The first one is when
to write the log.

753
00:37:45,040 --> 00:37:47,570
And the second one is,
you know, I sort of

754
00:37:47,570 --> 00:37:49,030
said you just look
through the log

755
00:37:49,030 --> 00:37:50,980
and undo the guys
who didn't commit,

756
00:37:50,980 --> 00:37:52,990
and redo the people
who committed.

757
00:37:52,990 --> 00:37:55,150
But you have to be very
careful about doing that.

758
00:37:55,150 --> 00:37:58,020
And that corresponds
to this question

759
00:37:58,020 --> 00:38:03,150
of exactly how to recover,
how to systematically recover

760
00:38:03,150 --> 00:38:05,920
so the state of the system is
as I have described before.

761
00:38:08,054 --> 00:38:09,970
So those are the questions
we're going to deal

762
00:38:09,970 --> 00:38:11,220
with for the next few minutes.

763
00:38:15,890 --> 00:38:18,580
Let's do this with
a specific example.

764
00:38:18,580 --> 00:38:20,190
And it will turn
out and to answer

765
00:38:20,190 --> 00:38:21,770
doesn't really depend
on the example.

766
00:38:21,770 --> 00:38:25,680
But the example is good to
give you the right intuition.

767
00:38:25,680 --> 00:38:28,690
And this example is actually
pretty common example

768
00:38:28,690 --> 00:38:30,140
of a disk-bound database.

769
00:38:34,800 --> 00:38:40,270
So a disk bound
database is one where

770
00:38:40,270 --> 00:38:43,930
you have applications
writing to a database, which

771
00:38:43,930 --> 00:38:47,000
is where the cell
storage is implemented.

772
00:38:47,000 --> 00:38:48,560
And the cell storage is on disk.

773
00:38:52,020 --> 00:38:57,680
So, you might have
writes of cell items, X,

774
00:38:57,680 --> 00:39:00,240
and they go to a database.

775
00:39:00,240 --> 00:39:03,510
And similarly, in any
disk bound database

776
00:39:03,510 --> 00:39:05,240
that you want
crash recovery for,

777
00:39:05,240 --> 00:39:06,600
you need to maintain a log.

778
00:39:06,600 --> 00:39:09,699
And for various reasons
having to do primarily

779
00:39:09,699 --> 00:39:11,990
with dealing with failures
of the disk hardware itself,

780
00:39:11,990 --> 00:39:15,020
it's very often useful
to an experience

781
00:39:15,020 --> 00:39:18,360
to maintain the log
on a different disk.

782
00:39:18,360 --> 00:39:20,230
So we'll maintain
for this example

783
00:39:20,230 --> 00:39:22,330
the log on a different disk.

784
00:39:22,330 --> 00:39:26,970
So whenever write X is done,
just looking at the log data

785
00:39:26,970 --> 00:39:31,030
structure, you need to
write an update record

786
00:39:31,030 --> 00:39:33,160
and append that to the log.

787
00:39:33,160 --> 00:39:36,530
So at some point you would
need to write this to the log.

788
00:39:36,530 --> 00:39:40,700
You need to log the update --

789
00:39:40,700 --> 00:39:47,650
-- that says that X change from
something to something else.

790
00:39:47,650 --> 00:39:52,049
So the question is, when
do you write both of these?

791
00:39:52,049 --> 00:39:54,340
So one approach might be that
it really doesn't matter.

792
00:39:54,340 --> 00:39:56,990
As long as the log gets
the data, you're fine.

793
00:39:56,990 --> 00:39:59,180
But that has a
couple of problems.

794
00:39:59,180 --> 00:40:02,140
In particular,
suppose you write X

795
00:40:02,140 --> 00:40:04,064
without writing the log entry.

796
00:40:04,064 --> 00:40:06,230
And as soon as you write
X, before you have a chance

797
00:40:06,230 --> 00:40:10,140
to write to the log,
you crash, or the system

798
00:40:10,140 --> 00:40:14,420
causes this program to abort,
or the program itself aborts.

799
00:40:14,420 --> 00:40:17,250
It writes X and then it
does some calculation

800
00:40:17,250 --> 00:40:20,300
and the it decides to abort.

801
00:40:20,300 --> 00:40:25,260
Now you are in trouble because
the log hasn't kept track yet

802
00:40:25,260 --> 00:40:27,360
the log hasn't had
a chance of keeping

803
00:40:27,360 --> 00:40:31,240
track of what the
old value was, which

804
00:40:31,240 --> 00:40:32,930
means that if you
really want to restore

805
00:40:32,930 --> 00:40:36,730
this database by
undoing this write to X,

806
00:40:36,730 --> 00:40:38,380
you have to do a
whole lot of work.

807
00:40:38,380 --> 00:40:40,110
And it might be
impossible to do it.

808
00:40:40,110 --> 00:40:42,900
If you didn't know, for example,
what the current value was,

809
00:40:42,900 --> 00:40:44,470
there was absolutely
no way for you

810
00:40:44,470 --> 00:40:48,890
to restore to the old value.

811
00:40:48,890 --> 00:40:52,910
So what this suggests is
that you better not write

812
00:40:52,910 --> 00:40:55,550
to the cell store before
you write to the log

813
00:40:55,550 --> 00:40:59,350
because if you wrote to
the cell store log write,

814
00:40:59,350 --> 00:41:03,190
and the system crashed right
after or failure about it,

815
00:41:03,190 --> 00:41:05,370
you won't really
have a way in general

816
00:41:05,370 --> 00:41:09,010
of reverting to the
version of the data item

817
00:41:09,010 --> 00:41:10,429
before this write.

818
00:41:10,429 --> 00:41:12,470
And you do need to revert
because it just aborted

819
00:41:12,470 --> 00:41:12,970
or fails.

820
00:41:12,970 --> 00:41:18,280
So you need to back out of
all changes that were made.

821
00:41:18,280 --> 00:41:21,630
So that suggests the
first part of our protocol

822
00:41:21,630 --> 00:41:23,615
which we are going to
call the wall protocol.

823
00:41:26,809 --> 00:41:29,100
Actually, that is the wall,
I mean, not the first part.

824
00:41:29,100 --> 00:41:30,474
This suggests this
wall protocol.

825
00:41:30,474 --> 00:41:32,090
Wall stands for
write-ahead logging.

826
00:41:38,940 --> 00:41:46,930
And the protocol says update
the log or append to the log

827
00:41:46,930 --> 00:41:50,570
before you write
to the cell store.

828
00:41:50,570 --> 00:41:51,630
It's what it says.

829
00:41:51,630 --> 00:41:58,440
Write ahead log says write the
log before you write the cell

830
00:41:58,440 --> 00:42:00,500
store.

831
00:42:00,500 --> 00:42:03,630
The advantage of writing the
log before you write to the cell

832
00:42:03,630 --> 00:42:09,930
store is that suppose now
you set X to some value

833
00:42:09,930 --> 00:42:12,110
and then you crashed.

834
00:42:12,110 --> 00:42:15,400
Then you're guaranteed that
if the cell store got written,

835
00:42:15,400 --> 00:42:21,080
the log got written, which
means that if this action didn't

836
00:42:21,080 --> 00:42:22,960
commit, you can
go through the log

837
00:42:22,960 --> 00:42:26,570
and undo that action because
you know that the log entry got

838
00:42:26,570 --> 00:42:29,550
written correctly before
the cell store got written.

839
00:42:29,550 --> 00:42:31,300
And if the log entry
didn't get written,

840
00:42:31,300 --> 00:42:32,520
then you know the
cell store didn't

841
00:42:32,520 --> 00:42:34,853
get written, which means you
don't have to undo anything

842
00:42:34,853 --> 00:42:36,810
for that particular data item.

843
00:42:36,810 --> 00:42:39,040
So either way you're fine.

844
00:42:39,040 --> 00:42:44,380
There is another
part of this protocol

845
00:42:44,380 --> 00:42:46,300
that we're going to need
to meet the semantics

846
00:42:46,300 --> 00:42:48,870
of a recoverable
action that we wanted,

847
00:42:48,870 --> 00:42:51,550
which is that once
you reach commit,

848
00:42:51,550 --> 00:42:54,110
you want the changes
made by that action

849
00:42:54,110 --> 00:42:55,990
to be visible to all
the other people,

850
00:42:55,990 --> 00:42:59,590
all of the other actions
that are subsequent actions.

851
00:42:59,590 --> 00:43:02,990
And what that means is
that before you return

852
00:43:02,990 --> 00:43:05,860
from the commit, you
had better make sure

853
00:43:05,860 --> 00:43:09,740
that the commit record for this
action is logged to the disk,

854
00:43:09,740 --> 00:43:15,950
is logged, because if you didn't
do that, and you just returned,

855
00:43:15,950 --> 00:43:23,150
then you can't be guaranteed
that all of the writes that

856
00:43:23,150 --> 00:43:25,270
were done to the cell
item were actually

857
00:43:25,270 --> 00:43:26,780
put on to the cell store.

858
00:43:26,780 --> 00:43:28,530
There's no guarantee
that these writes

859
00:43:28,530 --> 00:43:30,640
to the cell store actually
got written to the cell

860
00:43:30,640 --> 00:43:32,640
store because all you are
doing in this protocol

861
00:43:32,640 --> 00:43:34,600
is ensuring that the
writes to the log

862
00:43:34,600 --> 00:43:36,150
are being written before
the writes to the data.

863
00:43:36,150 --> 00:43:38,233
Nobody is saying when the
writes of the cell store

864
00:43:38,233 --> 00:43:39,880
really are happening
and finishing,

865
00:43:39,880 --> 00:43:45,080
which means if the
action commits,

866
00:43:45,080 --> 00:43:48,180
and you return committed to
the user to the application,

867
00:43:48,180 --> 00:43:50,280
then you had better have
a way of making sure

868
00:43:50,280 --> 00:43:51,660
that if the failure
now happened,

869
00:43:51,660 --> 00:43:54,290
the system when
it recovers knows

870
00:43:54,290 --> 00:43:58,070
that this action committed,
which means it follows, then,

871
00:43:58,070 --> 00:44:01,100
that if you want those
semantics that you'd better

872
00:44:01,100 --> 00:44:05,490
write the commit record, the
fact that this action committed

873
00:44:05,490 --> 00:44:08,200
to the log before
the commit returns.

874
00:44:08,200 --> 00:44:10,950
And really the only reason
you need that is that

875
00:44:10,950 --> 00:44:13,520
we've established; we've decided
that we wanted the semantics

876
00:44:13,520 --> 00:44:15,978
[the?] different action commits,
you want the results to be

877
00:44:15,978 --> 00:44:17,150
visible to everybody else.

878
00:44:17,150 --> 00:44:20,640
And later on, we'll see
that this is related

879
00:44:20,640 --> 00:44:23,730
to this notion of durability.

880
00:44:23,730 --> 00:44:30,620
So write commit record before --

881
00:44:35,120 --> 00:44:46,880
returning for commit.

882
00:44:46,880 --> 00:44:49,890
So two main ideas: write
ahead logging means

883
00:44:49,890 --> 00:44:52,060
make sure that you write
the log, append to the log

884
00:44:52,060 --> 00:44:54,070
before you write
to the cell store.

885
00:44:54,070 --> 00:44:57,400
And in order to make sure that
committed actions, the results

886
00:44:57,400 --> 00:45:00,010
of committed actions are
visible even after failure

887
00:45:00,010 --> 00:45:02,254
to subsequent actions,
log the commit record

888
00:45:02,254 --> 00:45:03,670
before you return
from the commit.

889
00:45:11,070 --> 00:45:12,570
So now we are
actually in good shape

890
00:45:12,570 --> 00:45:17,950
to specify this
recovery procedure

891
00:45:17,950 --> 00:45:20,870
that I've alluded to
before because the log is

892
00:45:20,870 --> 00:45:23,940
going to contain these update
records and these outcome

893
00:45:23,940 --> 00:45:25,390
records.

894
00:45:25,390 --> 00:45:27,830
And that's going to
allow us to decide

895
00:45:27,830 --> 00:45:30,496
what to do upon crash recovery.

896
00:45:30,496 --> 00:45:31,870
And actually the
only other piece

897
00:45:31,870 --> 00:45:35,299
we need is to decide
what happens on an abort.

898
00:45:35,299 --> 00:45:37,090
And that's actually
pretty straightforward.

899
00:45:37,090 --> 00:45:39,390
If the system calls abort,
or if the user application

900
00:45:39,390 --> 00:45:42,840
calls abort on an action,
what abort has to do

901
00:45:42,840 --> 00:45:44,500
is to look through the log.

902
00:45:44,500 --> 00:45:47,420
Remember that all of the
rights have been written.

903
00:45:47,420 --> 00:45:49,314
Any time a write happens,
you don't actually

904
00:45:49,314 --> 00:45:50,730
care about when
the write actually

905
00:45:50,730 --> 00:45:52,180
happens at the cell store.

906
00:45:52,180 --> 00:45:56,120
What you care about is that
the write happens to the log

907
00:45:56,120 --> 00:45:58,600
before the write happens
to the cell store.

908
00:45:58,600 --> 00:46:01,160
So, if an abort were
called, all you have to do

909
00:46:01,160 --> 00:46:03,890
is to ensure that
before abort returns,

910
00:46:03,890 --> 00:46:08,720
all of the actions done
by, all of the steps taken

911
00:46:08,720 --> 00:46:12,250
by this action around done, and
the corresponding cell values

912
00:46:12,250 --> 00:46:12,920
are on done.

913
00:46:15,860 --> 00:46:19,010
And that's all you have to
do when you implement abort.

914
00:46:22,550 --> 00:46:27,880
So one thing that I haven't
really specified very clearly

915
00:46:27,880 --> 00:46:30,310
is when the actual
writes happen to the disk

916
00:46:30,310 --> 00:46:31,960
or to any cell store.

917
00:46:31,960 --> 00:46:36,540
And it turns out that it
really doesn't matter.

918
00:46:36,540 --> 00:46:38,791
If there's no failure,
as long as you ensure,

919
00:46:38,791 --> 00:46:40,290
you could have
caches in the middle.

920
00:46:40,290 --> 00:46:41,498
You could have anything else.

921
00:46:41,498 --> 00:46:44,940
So, as long as you ensure that
if there's no concurrency,

922
00:46:44,940 --> 00:46:46,290
we'll deal with that next time.

923
00:46:46,290 --> 00:46:47,995
But as long as you
ensure that when

924
00:46:47,995 --> 00:46:50,120
you have actions that come
one after the other that

925
00:46:50,120 --> 00:46:53,390
are recoverable that
the values that are read

926
00:46:53,390 --> 00:46:57,810
are only the values that
were written by previously

927
00:46:57,810 --> 00:47:00,350
committed actions,
then it really

928
00:47:00,350 --> 00:47:03,750
doesn't matter when those
were actually written to disk.

929
00:47:03,750 --> 00:47:06,610
[NOISE OBSCURES] main
thing that matters is

930
00:47:06,610 --> 00:47:11,040
make sure the log keeps track
exactly of all the things

931
00:47:11,040 --> 00:47:13,180
to undo for uncommitted actions.

932
00:47:13,180 --> 00:47:14,680
And for things
that got committed,

933
00:47:14,680 --> 00:47:19,610
to make sure that the log keeps
track of the commit record

934
00:47:19,610 --> 00:47:20,780
before the commit returns.

935
00:47:24,730 --> 00:47:27,430
So given the story, the
way the recovery procedure

936
00:47:27,430 --> 00:47:28,830
works as the following.

937
00:47:28,830 --> 00:47:31,676
The first step is the system
fails, and that it recovers.

938
00:47:31,676 --> 00:47:32,800
You scan the log backwards.

939
00:47:39,690 --> 00:47:41,820
And as you are scanning
the log backwards,

940
00:47:41,820 --> 00:47:45,430
you keep track of
two kinds of actions.

941
00:47:45,430 --> 00:47:50,490
You keep track of actions that
were either committed or were

942
00:47:50,490 --> 00:47:52,350
aborted, OK?

943
00:47:52,350 --> 00:47:55,620
And what that means
is that for actions

944
00:47:55,620 --> 00:47:58,600
that were committed or
aborted, the cell store

945
00:47:58,600 --> 00:48:01,400
for those actions is
in a certain state

946
00:48:01,400 --> 00:48:03,400
or needs to be in
a certain state.

947
00:48:03,400 --> 00:48:05,080
For committed actions,
it needs to be

948
00:48:05,080 --> 00:48:08,470
in a state that's the result of
finishing the committed action.

949
00:48:08,470 --> 00:48:10,410
And for the aborted
actions, what it means

950
00:48:10,410 --> 00:48:12,780
is that when the abort
returned and there

951
00:48:12,780 --> 00:48:15,310
was an aborted
action, abort already

952
00:48:15,310 --> 00:48:17,249
undid the state
of the cell store

953
00:48:17,249 --> 00:48:19,540
by definition by the definition
of the abort procedure.

954
00:48:19,540 --> 00:48:22,190
So what that means is
for log records that

955
00:48:22,190 --> 00:48:24,834
contain a type outcome
and the status abort

956
00:48:24,834 --> 00:48:26,250
that you don't
have to do anything

957
00:48:26,250 --> 00:48:29,050
because the changes are
already on done before

958
00:48:29,050 --> 00:48:31,400
that abort record was written.

959
00:48:31,400 --> 00:48:33,360
So what you do in
scanning the log backwards

960
00:48:33,360 --> 00:48:35,550
is you build up two
kinds of actions.

961
00:48:35,550 --> 00:48:38,140
You build up winners,
which are actions

962
00:48:38,140 --> 00:48:42,355
that were committed or aborted.

963
00:48:45,300 --> 00:48:51,182
And you build up a list of
losers that were none of these.

964
00:48:51,182 --> 00:48:52,890
In other words, they
were pending actions

965
00:48:52,890 --> 00:48:56,320
that kind of just during a
failure they were pending,

966
00:48:56,320 --> 00:48:57,340
so they didn't commit.

967
00:48:57,340 --> 00:48:58,506
And they were never aborted.

968
00:49:06,150 --> 00:49:08,160
And so the plan
now is to make sure

969
00:49:08,160 --> 00:49:10,820
that the cell store is correctly
restored to the state that

970
00:49:10,820 --> 00:49:16,010
was before the crash where
all of the committed actions'

971
00:49:16,010 --> 00:49:20,080
results are visible, and none
of the uncommitted actions,

972
00:49:20,080 --> 00:49:22,430
you know, all of
those are blown away.

973
00:49:22,430 --> 00:49:26,835
All you have to do is
to [UNINTELLIGIBLE]

974
00:49:26,835 --> 00:49:27,460
were committed.

975
00:49:27,460 --> 00:49:29,220
You don't have to do anything
for the aborted winners

976
00:49:29,220 --> 00:49:30,910
because they were
already undone.

977
00:49:30,910 --> 00:49:37,370
So you have to redo
committed winners,

978
00:49:37,370 --> 00:49:46,020
and you have to undo any
changes made by losers, right,

979
00:49:46,020 --> 00:49:47,680
because these
losers by definition

980
00:49:47,680 --> 00:49:50,320
were things that didn't
commit or didn't abort.

981
00:49:50,320 --> 00:49:53,350
And the reason you only redo the
committed winners rather than

982
00:49:53,350 --> 00:49:55,735
all winners is it makes no
sense to redo aborted winners.

983
00:49:55,735 --> 00:49:58,110
And you don't need to undo
them because they were already

984
00:49:58,110 --> 00:50:06,410
undone when the abort record
was written to the log.

985
00:50:06,410 --> 00:50:08,460
So this is the basic
idea for dealing

986
00:50:08,460 --> 00:50:11,350
with one of these databases.

987
00:50:11,350 --> 00:50:13,050
But there's five or
six optimizations

988
00:50:13,050 --> 00:50:16,875
that end up making this
kind of system go faster.

989
00:50:16,875 --> 00:50:18,750
You'll see some of these
optimizations buried

990
00:50:18,750 --> 00:50:22,190
inside the system R paper, which
is the discussion for tomorrow.

991
00:50:22,190 --> 00:50:25,750
But what I'll do on Monday,
I'll spend five minutes talking

992
00:50:25,750 --> 00:50:28,230
about the most
important optimizations,

993
00:50:28,230 --> 00:50:30,440
and I think the whole
story will become clear.

994
00:50:30,440 --> 00:50:32,660
So the plan for the subsequent
lectures on this topic

995
00:50:32,660 --> 00:50:34,850
are: on Monday we'll
deal with isolation,

996
00:50:34,850 --> 00:50:37,580
and on Wednesday we'll continue
to talk about isolation,

997
00:50:37,580 --> 00:50:41,140
and then talk about a
different issue of consistency.