1 00:00:00,499 --> 00:00:04,400 We are continuing our discussion of fault-tolerance 2 00:00:04,400 --> 00:00:08,360 and atomicity. 3 00:00:08,360 --> 00:00:11,060 And sort of teaching these lectures 4 00:00:11,060 --> 00:00:14,170 makes me feel, in the beginning, like those TV shows where 5 00:00:14,170 --> 00:00:16,700 they always, before a new episode, 6 00:00:16,700 --> 00:00:20,370 tell you everything that happened so far in the season. 7 00:00:20,370 --> 00:00:23,480 So we will do the same thing. 8 00:00:23,480 --> 00:00:29,130 The story so far here is that in order to deal with failures, 9 00:00:29,130 --> 00:00:33,010 we came up with this idea of making modules have 10 00:00:33,010 --> 00:00:37,790 this property of atomicity, which actually has 11 00:00:37,790 --> 00:00:42,370 two aspects to it, one of which is an all or nothing aspect, 12 00:00:42,370 --> 00:00:49,400 which we call recoverability, and the other is a way 13 00:00:49,400 --> 00:00:51,640 to coordinate multiple concurrent activities so you 14 00:00:51,640 --> 00:00:55,870 get this illusion that they are all separate from each other. 15 00:00:55,870 --> 00:00:58,260 And we call that isolation. 16 00:00:58,260 --> 00:01:01,530 And the basic rule here for achieving recoverability 17 00:01:01,530 --> 00:01:04,090 was we repeatedly applied this one rule 18 00:01:04,090 --> 00:01:06,880 that we called the "Golden Rule of Recoverability" which is 19 00:01:06,880 --> 00:01:12,140 to never modify the only copy. 20 00:01:15,380 --> 00:01:16,870 And we used that rule to build up 21 00:01:16,870 --> 00:01:19,060 this idea of recoverable sector and we 22 00:01:19,060 --> 00:01:20,930 used that idea of recoverable sector 23 00:01:20,930 --> 00:01:23,770 to come up with two schemes for achieving recoverability, 24 00:01:23,770 --> 00:01:26,200 one using version histories where 25 00:01:26,200 --> 00:01:29,710 you had a special case of this "Never Modify 26 00:01:29,710 --> 00:01:32,490 the Only Copy Rule" which was never modify anything. 27 00:01:32,490 --> 00:01:34,040 So, for any given variable, you must 28 00:01:34,040 --> 00:01:36,930 create lots and lots of versions never updating 29 00:01:36,930 --> 00:01:39,100 anything in place. 30 00:01:39,100 --> 00:01:41,337 And then we decided that it is inefficient, 31 00:01:41,337 --> 00:01:43,920 so we came up with a different way of achieving recoverability 32 00:01:43,920 --> 00:01:46,760 using logging. 33 00:01:46,760 --> 00:01:50,000 Then for isolation, we did this last time 34 00:01:50,000 --> 00:01:58,540 where we talked about serializability 35 00:01:58,540 --> 00:02:03,160 where the goal was to allow the steps of the different actions 36 00:02:03,160 --> 00:02:07,000 to run in such a way that the result is as if they 37 00:02:07,000 --> 00:02:08,550 ran in some serial order. 38 00:02:11,450 --> 00:02:13,540 And we talked about a way of achieving that 39 00:02:13,540 --> 00:02:14,660 with cell storage. 40 00:02:14,660 --> 00:02:21,470 In particular, we talked about using locks as an abstraction, 41 00:02:21,470 --> 00:02:24,315 as a programming primitive to achieve isolation. 42 00:02:28,170 --> 00:02:31,840 And, in particular, the key idea that we saw here 43 00:02:31,840 --> 00:02:35,610 was that serializability implied that there 44 00:02:35,610 --> 00:02:38,430 were no cycles in this data structure called the action 45 00:02:38,430 --> 00:02:39,750 graph. 46 00:02:39,750 --> 00:02:42,220 And, as long as you could argue that 47 00:02:42,220 --> 00:02:45,870 for a given method of locking, as long as you could argue 48 00:02:45,870 --> 00:02:48,750 that the resulting action graph had no cycles, 49 00:02:48,750 --> 00:02:50,730 you were guaranteed serializability. 50 00:02:50,730 --> 00:02:53,490 And, therefore, the scheme provided isolation. 51 00:02:53,490 --> 00:02:56,450 And, in particular, a scheme we looked at near the end 52 00:02:56,450 --> 00:02:57,520 was two-phase locking. 53 00:03:01,930 --> 00:03:06,660 Where the idea is that you never acquire a lock for any item 54 00:03:06,660 --> 00:03:09,570 if you have done a release of any other lock 55 00:03:09,570 --> 00:03:12,620 in the atomic action so far. 56 00:03:12,620 --> 00:03:14,750 And that's the reason why this is called two-phase 57 00:03:14,750 --> 00:03:18,560 locking because if you look at the two phases 58 00:03:18,560 --> 00:03:21,020 being a lock acquisition phase where the only thing that 59 00:03:21,020 --> 00:03:23,014 is happening is locks are being acquired 60 00:03:23,014 --> 00:03:25,180 and nothing is being released so the number of locks 61 00:03:25,180 --> 00:03:27,185 is strictly increasing with time. 62 00:03:27,185 --> 00:03:28,560 And then there is a certain point 63 00:03:28,560 --> 00:03:31,190 of the action after which you can only release locks. 64 00:03:31,190 --> 00:03:33,890 And you cannot acquire a lock the moment you have released 65 00:03:33,890 --> 00:03:35,500 any given lock. 66 00:03:35,500 --> 00:03:39,400 And the way we argued this protocol achieved isolation was 67 00:03:39,400 --> 00:03:41,890 to consider the action graph resulting 68 00:03:41,890 --> 00:03:48,070 from some execution of two-phase locking 69 00:03:48,070 --> 00:03:51,280 and argued that if there was a cycle in that resulting action 70 00:03:51,280 --> 00:03:54,030 graph then two-phase locking gets violated. 71 00:03:54,030 --> 00:03:55,530 And, therefore, two-phase locking 72 00:03:55,530 --> 00:03:59,647 provides an action graph which does not have cycles 73 00:03:59,647 --> 00:04:01,355 and, therefore, achieves serializability. 74 00:04:04,540 --> 00:04:07,320 Two-phase locking is fine and is a really good idea 75 00:04:07,320 --> 00:04:09,654 if you're into using locks. 76 00:04:09,654 --> 00:04:12,070 It has the property that you do not actually need to know, 77 00:04:12,070 --> 00:04:14,580 the action does not need to know at the beginning which 78 00:04:14,580 --> 00:04:18,712 data items it is going to access which means that all you 79 00:04:18,712 --> 00:04:21,170 need to do is to make sure that you do not release anything 80 00:04:21,170 --> 00:04:23,829 until everything has been acquired. 81 00:04:23,829 --> 00:04:25,870 But you do not have to know which ones to acquire 82 00:04:25,870 --> 00:04:28,740 before the start of the action. 83 00:04:28,740 --> 00:04:31,880 You just have to keep acquiring sort of on demand 84 00:04:31,880 --> 00:04:35,240 until typically you get to the commit point. 85 00:04:35,240 --> 00:04:37,675 And once you commit you can release the locks. 86 00:04:40,400 --> 00:04:46,140 Now, in theory you can release a lock at any given time 87 00:04:46,140 --> 00:04:49,860 once you are sure that you are not going to acquire anymore 88 00:04:49,860 --> 00:04:54,270 locks, but that theoretical approach only 89 00:04:54,270 --> 00:04:56,060 works if you are guaranteed that there 90 00:04:56,060 --> 00:04:59,170 will be no aborts happening. 91 00:04:59,170 --> 00:05:02,761 Now, in general, you cannot know beforehand when an action might 92 00:05:02,761 --> 00:05:03,260 abort. 93 00:05:03,260 --> 00:05:04,885 I mean the system might decide to abort 94 00:05:04,885 --> 00:05:06,860 an action for a variety of reasons, 95 00:05:06,860 --> 00:05:10,022 and we will see some reasons today. 96 00:05:10,022 --> 00:05:11,480 In practice, what ends up happening 97 00:05:11,480 --> 00:05:16,260 is that when you abort you have to go back 98 00:05:16,260 --> 00:05:18,780 and look through the log and undo the actions, 99 00:05:18,780 --> 00:05:22,384 run the undo steps associated with steps 100 00:05:22,384 --> 00:05:23,550 that happened in the action. 101 00:05:23,550 --> 00:05:28,060 Which means that because the undo goes ahead 102 00:05:28,060 --> 00:05:31,490 into cell storage and uninstalls whatever changes were made, 103 00:05:31,490 --> 00:05:35,390 it better be the case that when abort starts undoing things, 104 00:05:35,390 --> 00:05:37,990 it better be the case that the cell storage items that 105 00:05:37,990 --> 00:05:41,840 are being undone have locks owned by this action that 106 00:05:41,840 --> 00:05:44,250 is doing the undo. 107 00:05:44,250 --> 00:05:47,690 What this means is that if you have a set of statements here 108 00:05:47,690 --> 00:05:54,450 that are doing read of x and a write of y and things like that 109 00:05:54,450 --> 00:05:59,319 and then you have commit here -- 110 00:05:59,319 --> 00:06:01,110 But this is the last point in time at which 111 00:06:01,110 --> 00:06:02,276 you are reading and writing. 112 00:06:02,276 --> 00:06:05,860 And after that you are doing some computation here not 113 00:06:05,860 --> 00:06:07,215 involving any reads or writes. 114 00:06:11,070 --> 00:06:15,400 The action might abort anywhere here because the process, 115 00:06:15,400 --> 00:06:16,990 I mean this thread might be terminated 116 00:06:16,990 --> 00:06:19,880 and the action will have to abort. 117 00:06:19,880 --> 00:06:24,120 What that means is that a release for a data item that 118 00:06:24,120 --> 00:06:27,710 is required by abort in order to undo the state 119 00:06:27,710 --> 00:06:30,450 changes that have been made, that lock 120 00:06:30,450 --> 00:06:32,750 had better not be released before here. 121 00:06:32,750 --> 00:06:35,210 Because if the lock got released here then 122 00:06:35,210 --> 00:06:37,680 some other action could have acquired the lock 123 00:06:37,680 --> 00:06:40,110 and gone ahead and started working with the changes made 124 00:06:40,110 --> 00:06:42,020 by this action. 125 00:06:42,020 --> 00:06:43,742 And now it is too late to abort. 126 00:06:43,742 --> 00:06:45,450 Someone else has already seen the changes 127 00:06:45,450 --> 00:06:47,820 that have been made. 128 00:06:47,820 --> 00:06:49,820 In fact, you cannot be guaranteed now that later 129 00:06:49,820 --> 00:06:53,320 on you actually can regain the lock and the results would be 130 00:06:53,320 --> 00:06:55,450 wrong. 131 00:06:55,450 --> 00:07:01,490 So, in fact, the two-phase locking rule really is that you 132 00:07:01,490 --> 00:07:06,000 cannot release any lock until all of the locks have been 133 00:07:06,000 --> 00:07:07,520 acquired. 134 00:07:07,520 --> 00:07:11,290 And, moreover, any locks that are needed in order for abort 135 00:07:11,290 --> 00:07:14,120 to successfully run had better not 136 00:07:14,120 --> 00:07:16,550 be released until you are sure that the action won't 137 00:07:16,550 --> 00:07:17,166 abort anymore. 138 00:07:17,166 --> 00:07:18,540 And the only time you can be sure 139 00:07:18,540 --> 00:07:19,998 that the action won't abort anymore 140 00:07:19,998 --> 00:07:23,290 is once commit has been done. 141 00:07:23,290 --> 00:07:28,360 What that really means is that the release 142 00:07:28,360 --> 00:07:30,020 of the locks of all of the items that 143 00:07:30,020 --> 00:07:32,590 are required for undoing this action 144 00:07:32,590 --> 00:07:35,210 had better happen after the commit point. 145 00:07:35,210 --> 00:07:36,960 And, moreover, no item locks should 146 00:07:36,960 --> 00:07:41,440 be released until all of the acquires have been done. 147 00:07:41,440 --> 00:07:43,190 Now the reason I've said this in two parts 148 00:07:43,190 --> 00:07:45,420 is that if you are just reading a data item, 149 00:07:45,420 --> 00:07:48,120 you actually don't need to hold onto the lock of that item 150 00:07:48,120 --> 00:07:51,170 in order to do the undo because all you did was read x. 151 00:07:51,170 --> 00:07:53,240 There is no change that happened to the variable 152 00:07:53,240 --> 00:07:55,917 x, which means although you need to acquire the lock of x 153 00:07:55,917 --> 00:07:58,000 in order to read it because you don't want to have 154 00:07:58,000 --> 00:08:01,810 other people making changes to it while you are reading it, 155 00:08:01,810 --> 00:08:04,719 you don't actually need to hold onto the lock of x in order 156 00:08:04,719 --> 00:08:06,510 to do the undo because you are not actually 157 00:08:06,510 --> 00:08:09,690 writing x during the undo step. 158 00:08:13,340 --> 00:08:15,590 So that's the amendment to the two-phase locking rule. 159 00:08:15,590 --> 00:08:18,730 Things that you need in order to do undos for 160 00:08:18,730 --> 00:08:20,590 should only be released after you 161 00:08:20,590 --> 00:08:22,930 are sure that no aborts will be done, which 162 00:08:22,930 --> 00:08:24,370 means after the commit point. 163 00:08:29,122 --> 00:08:30,580 This way of doing two-phase locking 164 00:08:30,580 --> 00:08:34,010 is actually a pretty good scheme. 165 00:08:34,010 --> 00:08:36,669 And it turns out that, in many ways, 166 00:08:36,669 --> 00:08:39,740 it is the most efficient and most general method. 167 00:08:39,740 --> 00:08:42,740 What that means is that there might be special cases where 168 00:08:42,740 --> 00:08:46,130 other ways of other protocols for doing locks of objects 169 00:08:46,130 --> 00:08:48,460 perform better under certain special cases 170 00:08:48,460 --> 00:08:51,360 compared to two-phase locking, but if you've 171 00:08:51,360 --> 00:08:55,210 bought into using locks in order to do concurrency control 172 00:08:55,210 --> 00:08:57,390 and you don't know very much about the nature 173 00:08:57,390 --> 00:09:00,580 of the actions involved then two-phase locking 174 00:09:00,580 --> 00:09:01,474 is quite efficient. 175 00:09:01,474 --> 00:09:03,390 I mean there are variants of two-phase locking 176 00:09:03,390 --> 00:09:06,190 but, by and large, this idea, it's very hard 177 00:09:06,190 --> 00:09:08,810 to do much better than this in a very general sense 178 00:09:08,810 --> 00:09:12,330 if you're using locking for doing concurrency control. 179 00:09:15,777 --> 00:09:17,110 But there are a set of problems. 180 00:09:17,110 --> 00:09:19,980 It's not two-phase locking, as we have described it so far, 181 00:09:19,980 --> 00:09:22,070 completely solves the problem of insuring 182 00:09:22,070 --> 00:09:25,079 that actions perform well. 183 00:09:25,079 --> 00:09:26,620 And a particular problem that happens 184 00:09:26,620 --> 00:09:29,880 any time you use locks like here is deadlocks. 185 00:09:35,102 --> 00:09:36,560 And we have actually seen deadlocks 186 00:09:36,560 --> 00:09:39,820 before in an earlier chapter when 187 00:09:39,820 --> 00:09:42,582 we talked about synchronization of threads, 188 00:09:42,582 --> 00:09:44,040 and it is exactly the same problem. 189 00:09:44,040 --> 00:09:47,900 And the way you deal with it pretty much is almost the same. 190 00:09:50,529 --> 00:09:51,570 What is the problem here? 191 00:09:51,570 --> 00:09:54,470 Well, what could happen is that one action does 192 00:09:54,470 --> 00:10:02,609 read x and write y and the other action does read y and write x. 193 00:10:02,609 --> 00:10:04,650 And now you intersperse the acquires and releases 194 00:10:04,650 --> 00:10:08,670 so you do an acquire of lx here and maybe you 195 00:10:08,670 --> 00:10:14,510 do an acquire of ly here and here you do an acquire of ly 196 00:10:14,510 --> 00:10:17,050 and you do an acquire of lx. 197 00:10:17,050 --> 00:10:19,920 And what could happen is that once you get to this stage 198 00:10:19,920 --> 00:10:24,420 where this action has come this far and is about to run this 199 00:10:24,420 --> 00:10:26,540 and this other action has come up to here, 200 00:10:26,540 --> 00:10:28,857 now you are stuck because this action has 201 00:10:28,857 --> 00:10:30,940 to wait until that is released and this action has 202 00:10:30,940 --> 00:10:33,820 to wait until that is released and neither can make progress. 203 00:10:39,382 --> 00:10:41,590 So there are a few different ways of dealing with it. 204 00:10:41,590 --> 00:10:45,470 And the simplest way and the way that turns out 205 00:10:45,470 --> 00:10:49,490 to be one that is often used in practice both 206 00:10:49,490 --> 00:10:53,219 because it is simple and because once you implement 207 00:10:53,219 --> 00:10:55,260 the technique you don't have to do very much else 208 00:10:55,260 --> 00:10:57,850 is to just set timers on actions. 209 00:10:57,850 --> 00:10:59,650 So it's just to timeout. 210 00:10:59,650 --> 00:11:03,360 And if you notice that for a period of time 211 00:11:03,360 --> 00:11:05,250 an action has not made any progress 212 00:11:05,250 --> 00:11:10,180 then have a timeout that is associated with the action. 213 00:11:10,180 --> 00:11:12,810 And if the action itself notices that it 214 00:11:12,810 --> 00:11:15,440 hasn't made any progress, perhaps in another thread, 215 00:11:15,440 --> 00:11:18,080 then just go ahead and abort this thread. 216 00:11:18,080 --> 00:11:20,380 Now, it is perfectly OK to abort. 217 00:11:20,380 --> 00:11:22,930 And, in this particular case, aborting 218 00:11:22,930 --> 00:11:24,820 either of these actions is enough 219 00:11:24,820 --> 00:11:29,590 and the other will make progress and then you are done. 220 00:11:29,590 --> 00:11:33,210 And then the action that got aborted can retry. 221 00:11:33,210 --> 00:11:36,420 So the first solution is to just use a timer. 222 00:11:41,379 --> 00:11:42,920 And there is a school of thought that 223 00:11:42,920 --> 00:11:45,380 believes that in practice deadlocks 224 00:11:45,380 --> 00:11:47,836 should not be very common. 225 00:11:47,836 --> 00:11:49,960 And the reason is that deadlocks occur if there is, 226 00:11:49,960 --> 00:11:52,390 you know, there has to be a contention for resources 227 00:11:52,390 --> 00:11:55,630 and there has to be contention for multiple threads 228 00:11:55,630 --> 00:11:57,634 for the same resources. 229 00:11:57,634 --> 00:11:59,300 And it has to be more than one resource, 230 00:11:59,300 --> 00:12:02,010 because if you just have one resource you cannot really get 231 00:12:02,010 --> 00:12:06,020 a deadlock which means that you are sort of running multiple 232 00:12:06,020 --> 00:12:09,255 actions that are contending for a number of different shared 233 00:12:09,255 --> 00:12:09,755 objects. 234 00:12:14,420 --> 00:12:16,930 And what that suggests is if there 235 00:12:16,930 --> 00:12:20,340 is a high degree of concurrency like that and shared contention 236 00:12:20,340 --> 00:12:23,510 then it may be hard for you to get high performance. 237 00:12:23,510 --> 00:12:25,620 A lot of people think that the right way 238 00:12:25,620 --> 00:12:28,650 to be designing applications is to try 239 00:12:28,650 --> 00:12:32,040 hard to insure that the degree of sharing between objects 240 00:12:32,040 --> 00:12:33,580 is actually quite small. 241 00:12:33,580 --> 00:12:35,240 For example, rather than set up a lock 242 00:12:35,240 --> 00:12:38,210 on an entire big database table, you 243 00:12:38,210 --> 00:12:40,930 might set up locks at final granularities. 244 00:12:40,930 --> 00:12:42,960 And if you set up locks at final granularities 245 00:12:42,960 --> 00:12:45,530 the chances of multiple actions wanting to gain access 246 00:12:45,530 --> 00:12:49,390 to the same exact fine-grained entry in a table 247 00:12:49,390 --> 00:12:51,220 might be small. 248 00:12:51,220 --> 00:12:54,520 And in that situation, given that the chances 249 00:12:54,520 --> 00:12:56,220 of a deadlock occurring are rare, 250 00:12:56,220 --> 00:12:58,637 timing out every once in a while and aborting an action 251 00:12:58,637 --> 00:12:59,970 is not going to be catastrophic. 252 00:12:59,970 --> 00:13:00,330 It's OK. 253 00:13:00,330 --> 00:13:01,190 It is a rare event. 254 00:13:01,190 --> 00:13:03,140 So rather than spend a whole lot of complexity 255 00:13:03,140 --> 00:13:05,010 dealing with that rare event, just 256 00:13:05,010 --> 00:13:08,580 go ahead and let something abort. 257 00:13:08,580 --> 00:13:10,740 Let an action that hasn't made any progress abort. 258 00:13:10,740 --> 00:13:12,350 Moreover, these timers are necessary 259 00:13:12,350 --> 00:13:14,520 anyway because an action might end up 260 00:13:14,520 --> 00:13:16,842 getting stuck in an infinite loop 261 00:13:16,842 --> 00:13:19,050 or it might end up getting stuck in a situation where 262 00:13:19,050 --> 00:13:21,720 it is not really waiting for a lock there is just a bug in it. 263 00:13:21,720 --> 00:13:24,170 There is a problem with it, it is not really 264 00:13:24,170 --> 00:13:26,794 making any progress and maybe it is consuming resources 265 00:13:26,794 --> 00:13:28,210 and no one else can make progress. 266 00:13:28,210 --> 00:13:30,910 So the system anyway needs a way to abort those actions. 267 00:13:30,910 --> 00:13:32,710 And it needs a timeout mechanism anyway. 268 00:13:32,710 --> 00:13:34,470 So why not just use that same mechanism 269 00:13:34,470 --> 00:13:40,060 to deal with deadlocks as well. 270 00:13:40,060 --> 00:13:42,240 Probably somewhat a minority, but some other people 271 00:13:42,240 --> 00:13:43,740 believe that deadlocks might happen. 272 00:13:43,740 --> 00:13:46,180 And, when they do happen, perhaps 273 00:13:46,180 --> 00:13:48,390 because the granularity of locking in your system 274 00:13:48,390 --> 00:13:52,144 is not fine-grained then you do not want to get stuck. 275 00:13:52,144 --> 00:13:53,560 And you want to optimize, at least 276 00:13:53,560 --> 00:13:54,910 you want to do reasonably well rather than 277 00:13:54,910 --> 00:13:56,680 waiting for some long timeout period 278 00:13:56,680 --> 00:13:58,960 before aborting an action. 279 00:13:58,960 --> 00:14:05,120 And people who believe that build a data structured called 280 00:14:05,120 --> 00:14:06,310 the "Waits-For Graph". 281 00:14:10,932 --> 00:14:12,390 And the best way to understand this 282 00:14:12,390 --> 00:14:20,530 is imagine you have a database system that supports isolation 283 00:14:20,530 --> 00:14:22,450 and any time you want to acquire a lock you 284 00:14:22,450 --> 00:14:25,150 send a message to this entity in the database system 285 00:14:25,150 --> 00:14:27,870 called lock manager asking to acquire a lock. 286 00:14:27,870 --> 00:14:30,100 And any time you release it you do the same thing. 287 00:14:30,100 --> 00:14:34,760 What that lock manager can do, for each lock 288 00:14:34,760 --> 00:14:39,470 it can keep track of which actions running concurrently 289 00:14:39,470 --> 00:14:43,050 has acquired that lock and which action is waiting for a lock. 290 00:14:43,050 --> 00:14:46,849 And what you can do now is build up a graph of actions and locks 291 00:14:46,849 --> 00:14:49,390 and look to see whether there is some kind of cycle where you 292 00:14:49,390 --> 00:14:52,230 have action A waiting for lock B and lock 293 00:14:52,230 --> 00:14:55,650 B is being held by action C and action C is waiting for lock D 294 00:14:55,650 --> 00:14:57,670 and lock D is being held by action A. 295 00:14:57,670 --> 00:15:00,470 When you have a cycle in this graph 296 00:15:00,470 --> 00:15:02,650 then you know that you have a deadlock 297 00:15:02,650 --> 00:15:05,530 and none of those actions can make progress so go ahead 298 00:15:05,530 --> 00:15:07,229 and kill one. 299 00:15:07,229 --> 00:15:09,020 And you can be sophisticated about deciding 300 00:15:09,020 --> 00:15:10,160 which one to kill. 301 00:15:10,160 --> 00:15:12,020 You might kill the one, for example, 302 00:15:12,020 --> 00:15:14,160 that has been waiting the shortest amount of time 303 00:15:14,160 --> 00:15:16,860 because the others have been waiting longer so they might 304 00:15:16,860 --> 00:15:19,770 make progress, or you might have other policies for deciding 305 00:15:19,770 --> 00:15:22,890 which ones to kill. 306 00:15:22,890 --> 00:15:25,750 In practice, both these systems are used sometimes 307 00:15:25,750 --> 00:15:30,110 by the same system combining these ideas. 308 00:15:30,110 --> 00:15:33,370 For example, if you look at like an Oracle database system, 309 00:15:33,370 --> 00:15:34,610 it uses primarily timers. 310 00:15:34,610 --> 00:15:36,770 At least from what I could tell, it 311 00:15:36,770 --> 00:15:38,270 does not seem to have any mechanisms 312 00:15:38,270 --> 00:15:40,680 for really doing this check of a Waits-For graph. 313 00:15:40,680 --> 00:15:43,460 It just uses timers. 314 00:15:43,460 --> 00:15:46,610 And one of the oldest transaction processing systems 315 00:15:46,610 --> 00:15:49,190 was a system called CICS from IBM 316 00:15:49,190 --> 00:15:51,920 which also basically used timers, 317 00:15:51,920 --> 00:15:54,590 but there are other systems. 318 00:15:54,590 --> 00:15:58,630 For instance, IBM has this system called DB2 and Microsoft 319 00:15:58,630 --> 00:16:01,630 Sequence server that both use this Waits-For data structure. 320 00:16:01,630 --> 00:16:04,420 And, in fact, Microsoft's system seems 321 00:16:04,420 --> 00:16:07,260 to have a hundred thousand different knobs for deciding 322 00:16:07,260 --> 00:16:10,090 how to turn off deadlocks, including the ability 323 00:16:10,090 --> 00:16:12,630 to set various priorities on different actions that 324 00:16:12,630 --> 00:16:13,670 might be running. 325 00:16:13,670 --> 00:16:16,045 And it is not actually apparent that those knobs actually 326 00:16:16,045 --> 00:16:17,860 are usual for anything or how you set them 327 00:16:17,860 --> 00:16:21,210 but that they have a lot of things that you could set. 328 00:16:21,210 --> 00:16:22,120 Sounds familiar. 329 00:16:26,669 --> 00:16:27,960 Now, you can combine these two. 330 00:16:27,960 --> 00:16:33,112 And I think certain products combine these two ideas. 331 00:16:33,112 --> 00:16:35,070 One decision you have to make is to decide when 332 00:16:35,070 --> 00:16:36,870 to check this Waits-For graph. 333 00:16:36,870 --> 00:16:39,500 And an aggressive way of doing it is the moment anybody 334 00:16:39,500 --> 00:16:41,620 does an acquire or anybody does a release, 335 00:16:41,620 --> 00:16:45,530 in particular an acquire, you update your lock manager's data 336 00:16:45,530 --> 00:16:48,150 structure and immediately look to see if you have a cycle. 337 00:16:48,150 --> 00:16:50,190 Of course that takes time and effort. 338 00:16:50,190 --> 00:16:53,170 You might decide not to both but rather periodically 339 00:16:53,170 --> 00:16:54,806 look for cycles in this Waits-For graph 340 00:16:54,806 --> 00:16:56,930 when a timer fires, so every three seconds go ahead 341 00:16:56,930 --> 00:16:58,070 and look for cycles. 342 00:16:58,070 --> 00:17:02,110 So you might combine these ideas in a bunch of different ways. 343 00:17:02,110 --> 00:17:04,140 Now, if you recall from several lectures ago, 344 00:17:04,140 --> 00:17:05,639 another way of dealing with deadlock 345 00:17:05,639 --> 00:17:10,173 is to order all of the locks that an action might 346 00:17:10,173 --> 00:17:11,839 be able to acquire in a particular order 347 00:17:11,839 --> 00:17:13,255 and insure that all of the actions 348 00:17:13,255 --> 00:17:15,586 acquire the locks in exactly the same order. 349 00:17:15,586 --> 00:17:17,960 And that will insure there are no cycles because you have 350 00:17:17,960 --> 00:17:22,180 to go in the same order, but that idea requires 351 00:17:22,180 --> 00:17:25,369 you to know beforehand which data items 352 00:17:25,369 --> 00:17:26,569 you wish to gain access to. 353 00:17:26,569 --> 00:17:30,770 And that's often not possible in many systems in which you 354 00:17:30,770 --> 00:17:31,710 care about isolations. 355 00:17:31,710 --> 00:17:36,310 So that's usually not adopted at least in any database system. 356 00:17:36,310 --> 00:17:42,240 OK, so we talked about deadlocks. 357 00:17:42,240 --> 00:17:46,730 We talked about when you can release a lock that you acquire 358 00:17:46,730 --> 00:17:50,820 in order to abort because you cannot release it typically, 359 00:17:50,820 --> 00:17:52,810 in reality, until the commit point is done. 360 00:17:52,810 --> 00:17:54,830 The last issue we need to talk about 361 00:17:54,830 --> 00:17:57,415 is an interaction between logs and locks. 362 00:18:01,680 --> 00:18:03,180 And this interaction has to do with, 363 00:18:03,180 --> 00:18:05,270 so we already saw what happens when you abort. 364 00:18:05,270 --> 00:18:07,310 When you abort you need to undo so you better 365 00:18:07,310 --> 00:18:09,180 make sure that to do the undo you 366 00:18:09,180 --> 00:18:13,560 have the locks for those cell items. 367 00:18:13,560 --> 00:18:16,070 You don't have to abort but suppose you crash. 368 00:18:16,070 --> 00:18:20,050 Suppose the system crashes and recovers. 369 00:18:20,050 --> 00:18:22,750 At that point, when it recovers, it 370 00:18:22,750 --> 00:18:24,680 is going to run a recovery procedure which 371 00:18:24,680 --> 00:18:27,180 has some combination of redoing the winners 372 00:18:27,180 --> 00:18:29,491 and undoing the losers. 373 00:18:29,491 --> 00:18:31,490 Now, when it's undoing things and redoing things 374 00:18:31,490 --> 00:18:37,300 it needs access to items in the cell store. 375 00:18:37,300 --> 00:18:39,794 And we've already seen when the system is normally 376 00:18:39,794 --> 00:18:41,960 running, in order to change items in your cell store 377 00:18:41,960 --> 00:18:44,880 you need to gain access to locks. 378 00:18:44,880 --> 00:18:47,400 The question now is during crash recovery 379 00:18:47,400 --> 00:18:49,690 when the system is running this redo undo thing, 380 00:18:49,690 --> 00:18:53,404 where do you get these locks from and do you need 381 00:18:53,404 --> 00:18:54,570 to gain access to the locks? 382 00:18:57,539 --> 00:18:59,330 Now, in general, the answer to the question 383 00:18:59,330 --> 00:19:01,340 might be that you need to be very careful 384 00:19:01,340 --> 00:19:03,420 and perhaps need access to the locks 385 00:19:03,420 --> 00:19:05,370 when you're running recovery. 386 00:19:05,370 --> 00:19:12,310 But there is one simplification that systems typically 387 00:19:12,310 --> 00:19:14,340 make that eliminates that requirement. 388 00:19:14,340 --> 00:19:17,540 And that simplification is that during crash recovery 389 00:19:17,540 --> 00:19:22,010 you don't really allow new actions to run on your system. 390 00:19:22,010 --> 00:19:25,040 So when a system crashes and it is recovering, 391 00:19:25,040 --> 00:19:27,820 do not allow new actions to run until recovery is complete. 392 00:19:27,820 --> 00:19:30,522 And only then do you start new actions. 393 00:19:30,522 --> 00:19:31,980 What this means is now we just have 394 00:19:31,980 --> 00:19:35,070 to worry about insuring isolation clearly 395 00:19:35,070 --> 00:19:39,940 during recovery without having new actions coming 396 00:19:39,940 --> 00:19:43,260 in and muddling things up. 397 00:19:43,260 --> 00:19:45,030 The question really to think about 398 00:19:45,030 --> 00:19:50,240 is whether before the crash, because the log is 399 00:19:50,240 --> 00:19:51,980 the only thing you have in order to do 400 00:19:51,980 --> 00:19:54,150 recover, whether in the log you actually 401 00:19:54,150 --> 00:19:55,670 need to keep track of which locks 402 00:19:55,670 --> 00:20:01,340 were being held when the system was running just fine. 403 00:20:01,340 --> 00:20:02,990 And if it turns out that the log has 404 00:20:02,990 --> 00:20:06,130 to encode in it the locks that were being held, 405 00:20:06,130 --> 00:20:09,762 it could be quite complicated and a little bit messy. 406 00:20:09,762 --> 00:20:11,470 But if you think about it, the nice thing 407 00:20:11,470 --> 00:20:14,090 is that we don't actually have to encode the locks at all, 408 00:20:14,090 --> 00:20:15,060 store the locks at all. 409 00:20:15,060 --> 00:20:17,990 The locks can be completely involved in the storage. 410 00:20:17,990 --> 00:20:20,740 And that is because when you start off, 411 00:20:20,740 --> 00:20:24,510 when you have a log which has various redo items 412 00:20:24,510 --> 00:20:27,520 and undo items, in any element of the log, 413 00:20:27,520 --> 00:20:32,710 let's say an item x has been updated in that log entry. 414 00:20:32,710 --> 00:20:34,950 Then you know for sure that at the time 415 00:20:34,950 --> 00:20:38,540 this log entry was written, the action 416 00:20:38,540 --> 00:20:42,260 that was making this update did hold onto this lock 417 00:20:42,260 --> 00:20:44,350 and that this change being made here 418 00:20:44,350 --> 00:20:47,046 that got written to the log was, in fact, isolated assuming 419 00:20:47,046 --> 00:20:48,420 the locking protocol was correct, 420 00:20:48,420 --> 00:20:50,753 was, in fact, isolated from everything else concurrently 421 00:20:50,753 --> 00:20:53,200 that was going on. 422 00:20:53,200 --> 00:21:00,280 And so, although the locks are not explicit, 423 00:21:00,280 --> 00:21:03,630 the log encodes in it the actual serial order, some serial order 424 00:21:03,630 --> 00:21:07,320 of execution that did provide isolation before the crash. 425 00:21:07,320 --> 00:21:11,560 Therefore, if you just blindly go back through the log 426 00:21:11,560 --> 00:21:14,070 and make those changes in sequential order 427 00:21:14,070 --> 00:21:16,120 then you are assured that the changes you make 428 00:21:16,120 --> 00:21:18,660 are, in fact, going to be isolated from one another. 429 00:21:18,660 --> 00:21:20,160 So you do not actually have to worry 430 00:21:20,160 --> 00:21:23,400 about storing the locks before the crash into the log, 431 00:21:23,400 --> 00:21:25,125 and that makes life quite simple. 432 00:21:38,280 --> 00:21:44,390 That wraps up the discussion of atomicity and, in particular, 433 00:21:44,390 --> 00:21:45,760 isolations. 434 00:21:45,760 --> 00:21:48,472 For the rest of today and next time 435 00:21:48,472 --> 00:21:50,805 we are going to be talking about some uses of atomicity. 436 00:21:58,671 --> 00:21:59,920 And the plan is the following. 437 00:21:59,920 --> 00:22:02,290 The plan is the first application of atomicity which 438 00:22:02,290 --> 00:22:04,890 actually is the umbrella for a number of things 439 00:22:04,890 --> 00:22:09,310 we are going to be looking at is a transaction. 440 00:22:09,310 --> 00:22:11,680 And a transaction is defined as an atomic action 441 00:22:11,680 --> 00:22:14,610 that has a few other properties that it holds. 442 00:22:14,610 --> 00:22:19,470 And the first property is consistency 443 00:22:19,470 --> 00:22:21,505 and the second property is durability. 444 00:22:26,420 --> 00:22:29,340 And the second thing we are going to look at, 445 00:22:29,340 --> 00:22:35,420 next lecture actually, is atomicity 446 00:22:35,420 --> 00:22:37,110 when you have a distributed system. 447 00:22:37,110 --> 00:22:43,510 It is using atomicity on one computer 448 00:22:43,510 --> 00:22:46,430 to build out a system that provides atomicity 449 00:22:46,430 --> 00:22:47,500 in a distributed system. 450 00:22:52,785 --> 00:22:54,910 So we will talk about consistency the rest of today 451 00:22:54,910 --> 00:22:57,520 and the recitation for tomorrow looks 452 00:22:57,520 --> 00:22:59,980 at a paper for reconciling replicas, which 453 00:22:59,980 --> 00:23:02,030 is a particular aspect of consistency. 454 00:23:02,030 --> 00:23:04,870 And then next lecture next week we 455 00:23:04,870 --> 00:23:06,680 will talk about multi-site atomicity. 456 00:23:06,680 --> 00:23:10,400 And the recitation next week we will talk about durability. 457 00:23:10,400 --> 00:23:13,690 And once we do all of that, that kind of wraps up 458 00:23:13,690 --> 00:23:19,630 this fault-tolerance part of 6.033. 459 00:23:19,630 --> 00:23:21,745 Let me first talk a little bit about transactions. 460 00:23:27,081 --> 00:23:28,580 Transaction is an atomic action that 461 00:23:28,580 --> 00:23:31,870 has two other properties associated with it. 462 00:23:31,870 --> 00:23:36,190 And people in the literature often, in collegial terms, 463 00:23:36,190 --> 00:23:38,460 refer to transactions as having a property 464 00:23:38,460 --> 00:23:40,680 called the ACID property where ACID 465 00:23:40,680 --> 00:23:45,000 stands for atomicity, consistency, isolation 466 00:23:45,000 --> 00:23:46,755 and durability. 467 00:23:46,755 --> 00:23:48,380 And you will see this term a great deal 468 00:23:48,380 --> 00:23:50,890 in the literature and people will use this all the time. 469 00:23:50,890 --> 00:23:54,181 And, for various reasons, the way 470 00:23:54,181 --> 00:23:56,430 we have done things in this class, some of these terms 471 00:23:56,430 --> 00:24:01,320 are used in slightly different ways from the ACID term. 472 00:24:01,320 --> 00:24:06,340 When most people, at least in distributed systems 473 00:24:06,340 --> 00:24:09,980 and database systems, use the word atomicity, what they mean 474 00:24:09,980 --> 00:24:13,020 is what we meant by recoverability. 475 00:24:13,020 --> 00:24:14,690 So it is all or nothing. 476 00:24:14,690 --> 00:24:16,930 When they use the letter I here for isolation, 477 00:24:16,930 --> 00:24:20,750 they mean exactly the same thing here that we did. 478 00:24:20,750 --> 00:24:22,750 And consistency and durability unfortunately 479 00:24:22,750 --> 00:24:26,860 are going to mean the exact same thing. 480 00:24:26,860 --> 00:24:28,480 But really the point to notice is 481 00:24:28,480 --> 00:24:31,490 that these two properties, atomicity and isolation 482 00:24:31,490 --> 00:24:36,070 are things that are independent of an application. 483 00:24:36,070 --> 00:24:38,410 They just are properties of atomic actions 484 00:24:38,410 --> 00:24:42,120 that an atomic action can be recoverable 485 00:24:42,120 --> 00:24:43,882 and can be isolated. 486 00:24:43,882 --> 00:24:46,340 And you do not have to worry about what the application is. 487 00:24:46,340 --> 00:24:48,570 It could be an application in database systems. 488 00:24:48,570 --> 00:24:50,094 It could be something in a processor 489 00:24:50,094 --> 00:24:52,010 where you are trying to provide recoverability 490 00:24:52,010 --> 00:24:54,530 or isolation for instructions. 491 00:24:54,530 --> 00:24:56,422 These are properties that are, in some sense, 492 00:24:56,422 --> 00:24:58,130 somewhat more fundamental and lower layer 493 00:24:58,130 --> 00:25:02,310 properties than these other two properties. 494 00:25:02,310 --> 00:25:05,540 What consistency means is the property 495 00:25:05,540 --> 00:25:10,580 of an atomic action that is some application-specific invariant. 496 00:25:17,850 --> 00:25:19,250 Consistency of a transaction says 497 00:25:19,250 --> 00:25:20,708 that if you have a transaction that 498 00:25:20,708 --> 00:25:23,990 commits then some set of consistency invariants 499 00:25:23,990 --> 00:25:25,580 must hold. 500 00:25:25,580 --> 00:25:28,600 I will describe some examples of what this means. 501 00:25:28,600 --> 00:25:30,350 Consistent just says that there is 502 00:25:30,350 --> 00:25:33,660 some application-specific invariants that must hold. 503 00:25:37,480 --> 00:25:40,980 And durability says that if a transaction commits then 504 00:25:40,980 --> 00:25:43,180 the state changes that it has made, 505 00:25:43,180 --> 00:25:46,190 that the data items that it has changed 506 00:25:46,190 --> 00:25:48,880 has to last for some period of time. 507 00:25:48,880 --> 00:25:51,140 And the period of time that they have to last for 508 00:25:51,140 --> 00:25:53,300 is defined by the application. 509 00:25:53,300 --> 00:25:54,710 And there are many examples. 510 00:25:54,710 --> 00:25:56,230 A simple example of durability might 511 00:25:56,230 --> 00:25:59,450 be that the changes made by an atomic action 512 00:25:59,450 --> 00:26:02,590 just have to last until the entire thread finishes. 513 00:26:02,590 --> 00:26:04,170 And, at the other extreme, you could 514 00:26:04,170 --> 00:26:07,950 get into semantics of durability which 515 00:26:07,950 --> 00:26:09,910 say that the changes made by an atomic action 516 00:26:09,910 --> 00:26:12,660 have to last for three years or for five years 517 00:26:12,660 --> 00:26:16,690 or for forever which is a really hard thing to solve. 518 00:26:16,690 --> 00:26:20,240 But you might define semantics that relates 519 00:26:20,240 --> 00:26:21,671 to the permanence of data. 520 00:26:21,671 --> 00:26:23,170 For how long do you want the changes 521 00:26:23,170 --> 00:26:25,920 that you made to last and be visible 522 00:26:25,920 --> 00:26:28,135 to other atomic actions? 523 00:26:42,490 --> 00:26:44,110 There are two cases for consistency 524 00:26:44,110 --> 00:26:49,000 that we need to talk about. 525 00:26:49,000 --> 00:26:51,660 The first one is consistency in a centralized system. 526 00:26:57,790 --> 00:27:00,670 An example of this, and the most common example of this 527 00:27:00,670 --> 00:27:04,450 is in database systems that support transaction 528 00:27:04,450 --> 00:27:09,380 where you might have rules that are also called integrity 529 00:27:09,380 --> 00:27:16,630 rules for deciding whether you are allowing a transaction 530 00:27:16,630 --> 00:27:18,060 to commit or not. 531 00:27:18,060 --> 00:27:20,360 Let me give you a couple of examples of this. 532 00:27:23,140 --> 00:27:28,840 Let's say that you have a type of database system 533 00:27:28,840 --> 00:27:31,710 as a relational database system where all of the data 534 00:27:31,710 --> 00:27:34,390 is stored in tables. 535 00:27:34,390 --> 00:27:37,770 For example, you might have a table storing a student 536 00:27:37,770 --> 00:27:45,030 ID, a student name and let's say the department 537 00:27:45,030 --> 00:27:47,780 that the student belongs to. 538 00:27:47,780 --> 00:27:51,760 And let's say the departments have IDs. 539 00:27:51,760 --> 00:27:54,800 And you might have another table in your system 540 00:27:54,800 --> 00:28:02,590 that stores a department ID and a department name. 541 00:28:06,575 --> 00:28:07,950 Now, you might have a transaction 542 00:28:07,950 --> 00:28:11,880 that makes updates to entries in this table, you know, 543 00:28:11,880 --> 00:28:15,240 one or more rows in this table could actually make updates 544 00:28:15,240 --> 00:28:19,230 to just specific cells of this table. 545 00:28:19,230 --> 00:28:22,410 It could add a new student ID, add a name 546 00:28:22,410 --> 00:28:24,080 and add some department ID. 547 00:28:27,509 --> 00:28:29,550 Now, the kind of constraint we are worried about, 548 00:28:29,550 --> 00:28:32,400 the kind of invariants we are worried about 549 00:28:32,400 --> 00:28:35,700 are things where the person who has designed this database 550 00:28:35,700 --> 00:28:40,250 might say that you are not allowed to add a department 551 00:28:40,250 --> 00:28:42,930 ID that is nonexistent. 552 00:28:42,930 --> 00:28:45,320 And what that means is that there are these two tables. 553 00:28:45,320 --> 00:28:47,140 And you should not allow any transaction 554 00:28:47,140 --> 00:28:50,260 to write the department ID which is not already in this table. 555 00:28:50,260 --> 00:28:54,430 So if 43 might be in this table and 25 might be on this table, 556 00:28:54,430 --> 00:28:58,740 but a number that is not in this table should not be added here. 557 00:28:58,740 --> 00:29:01,980 And so the transaction processing system might decide, 558 00:29:01,980 --> 00:29:03,910 will, in fact, not allow this transaction 559 00:29:03,910 --> 00:29:06,570 to commit if it is writing a value that 560 00:29:06,570 --> 00:29:07,940 is not in this other table. 561 00:29:07,940 --> 00:29:13,300 And, for those familiar with databases, relation databases, 562 00:29:13,300 --> 00:29:16,270 there are these two tables called T1 and T2. 563 00:29:16,270 --> 00:29:18,200 This might be a primary key. 564 00:29:18,200 --> 00:29:19,880 Department ID might be a primary key 565 00:29:19,880 --> 00:29:25,120 of T2 defined as what is called a foreign key in T1, which 566 00:29:25,120 --> 00:29:27,490 means that you are not actually allowed to add something 567 00:29:27,490 --> 00:29:29,031 to a foreign key if it is not already 568 00:29:29,031 --> 00:29:34,850 in the other table where that same column is a primary key. 569 00:29:34,850 --> 00:29:37,990 So there are rules like this in most relational database 570 00:29:37,990 --> 00:29:39,900 systems and there are a variety of rules 571 00:29:39,900 --> 00:29:43,740 like this that all have to do with maintaining the integrity 572 00:29:43,740 --> 00:29:46,420 of the data that you add here. 573 00:29:46,420 --> 00:29:48,750 Now, this has nothing to do with isolation. 574 00:29:48,750 --> 00:29:53,490 It has to do with atomicity because these rules are 575 00:29:53,490 --> 00:29:55,370 typically checked at the commit point, 576 00:29:55,370 --> 00:29:57,194 because until then anything could happen. 577 00:29:57,194 --> 00:29:58,610 So, right before you commit, there 578 00:29:58,610 --> 00:30:00,320 are these invariants on the data that 579 00:30:00,320 --> 00:30:02,911 are application-specific that you need to check. 580 00:30:02,911 --> 00:30:04,410 But it has nothing to do with locks. 581 00:30:04,410 --> 00:30:05,370 It has nothing to do with anything. 582 00:30:05,370 --> 00:30:07,670 It sort of presumes atomicity, and after that it 583 00:30:07,670 --> 00:30:10,291 checks these application-specific rules. 584 00:30:10,291 --> 00:30:11,790 And you can get quite sophisticated. 585 00:30:11,790 --> 00:30:14,540 Some of these things about primary keys and secondary keys 586 00:30:14,540 --> 00:30:17,510 are things that are checked by most transaction 587 00:30:17,510 --> 00:30:20,540 processing systems, but you could get quite sophisticated 588 00:30:20,540 --> 00:30:21,320 about these rules. 589 00:30:21,320 --> 00:30:24,530 For example, you could have rules. 590 00:30:24,530 --> 00:30:26,410 Let's say you have a database storing 591 00:30:26,410 --> 00:30:28,420 employees and their salaries. 592 00:30:28,420 --> 00:30:31,960 You could have rules that say any time an employee gets 593 00:30:31,960 --> 00:30:35,760 a raise then everybody else in the same peer group 594 00:30:35,760 --> 00:30:37,900 also gets some kind of raise. 595 00:30:37,900 --> 00:30:39,810 And so you wouldn't allow any transaction 596 00:30:39,810 --> 00:30:43,420 to commit that did not insure that invariant to hold. 597 00:30:43,420 --> 00:30:45,770 And checking these things could be quite difficult, 598 00:30:45,770 --> 00:30:47,990 and most systems do not actually do a really good job 599 00:30:47,990 --> 00:30:49,760 of checking these things. 600 00:30:49,760 --> 00:30:51,920 The sets of rules they allow you to write 601 00:30:51,920 --> 00:30:54,400 is quite limited because checking it is quite hard, 602 00:30:54,400 --> 00:30:56,691 because when you are trying to commit a transaction now 603 00:30:56,691 --> 00:30:59,100 you might have to check a large number of rules. 604 00:30:59,100 --> 00:31:03,180 And some of them could be both time-consuming and complicated. 605 00:31:03,180 --> 00:31:06,870 But the main point here is that these rules 606 00:31:06,870 --> 00:31:08,550 are application-specific. 607 00:31:08,550 --> 00:31:13,650 And that is what defines consistency of the data 608 00:31:13,650 --> 00:31:16,230 that you have. 609 00:31:16,230 --> 00:31:18,470 The more interesting case for consistency 610 00:31:18,470 --> 00:31:21,260 and the thing that is going to occupy us 611 00:31:21,260 --> 00:31:23,740 for the rest of today and tomorrow 612 00:31:23,740 --> 00:31:25,773 is consistency in distributed systems. 613 00:31:30,750 --> 00:31:32,460 In particular, when the same data 614 00:31:32,460 --> 00:31:34,791 gets distributed, typically for fault-tolerance 615 00:31:34,791 --> 00:31:36,790 and for availability, to insure that the data is 616 00:31:36,790 --> 00:31:38,830 available at different locations, 617 00:31:38,830 --> 00:31:41,420 you end up with consistency problems. 618 00:31:41,420 --> 00:31:45,480 And we have already seen a few examples of this. 619 00:31:45,480 --> 00:31:47,880 One example of this is in the "Domain Name System" 620 00:31:47,880 --> 00:31:53,240 which maintains mapping between domain names and IP addresses. 621 00:31:53,240 --> 00:31:56,900 And, if you remember, in order to achieve availability 622 00:31:56,900 --> 00:32:00,200 and good performance, these mappings 623 00:32:00,200 --> 00:32:02,240 between DNS names and IP addresses where 624 00:32:02,240 --> 00:32:05,230 cached essentially on demand. 625 00:32:05,230 --> 00:32:07,970 Whenever a name server on the Internet 626 00:32:07,970 --> 00:32:10,570 made an access to that name it address 627 00:32:10,570 --> 00:32:13,650 cached the mapping results. 628 00:32:13,650 --> 00:32:15,080 And so now you have to worry about 629 00:32:15,080 --> 00:32:17,450 whether the data that is cached somewhere out on the Internet 630 00:32:17,450 --> 00:32:19,800 is, in fact, the correct data where correct is defined 631 00:32:19,800 --> 00:32:22,530 as the data that is being maintained by the primary name 632 00:32:22,530 --> 00:32:25,380 server. 633 00:32:25,380 --> 00:32:28,530 And if you think about DNS did, it actually 634 00:32:28,530 --> 00:32:35,870 used a mechanism of expiration times 635 00:32:35,870 --> 00:32:39,870 to keep this cache consistent. 636 00:32:39,870 --> 00:32:43,020 And what that means is that the only time 637 00:32:43,020 --> 00:32:45,810 you are guaranteed that the data in a cache 638 00:32:45,810 --> 00:32:49,260 is, in fact, the data that is stored at the primary name 639 00:32:49,260 --> 00:32:55,740 server for that name is when this expiration time finishes. 640 00:32:55,740 --> 00:32:57,850 And the first access after the expiration time 641 00:32:57,850 --> 00:33:01,410 requires the name server to go to the original primary name 642 00:33:01,410 --> 00:33:04,710 server and do a look up of the name. 643 00:33:07,646 --> 00:33:10,020 So the rest of the time you cannot actually be guaranteed 644 00:33:10,020 --> 00:33:12,640 that the data is consistent. 645 00:33:12,640 --> 00:33:15,640 And, in other words, you are not getting 646 00:33:15,640 --> 00:33:17,555 what is considered strong consistency. 647 00:33:20,410 --> 00:33:21,905 What is strong consistency? 648 00:33:27,550 --> 00:33:30,950 One way to define the semantics of what 649 00:33:30,950 --> 00:33:33,240 it means for data to be consistent in a distributed 650 00:33:33,240 --> 00:33:36,990 system is it is sort of a natural definition which 651 00:33:36,990 --> 00:33:40,360 is to see that any time you do a read anywhere, 652 00:33:40,360 --> 00:33:42,640 any node does a read of some data, 653 00:33:42,640 --> 00:33:44,980 read returns the result of the late write. 654 00:33:54,010 --> 00:33:57,050 That is one notion of consistency. 655 00:33:57,050 --> 00:33:59,530 And a system provides strong consistency 656 00:33:59,530 --> 00:34:02,470 if you can insure that every read returns 657 00:34:02,470 --> 00:34:05,350 the result of the last write that was done on the data. 658 00:34:09,219 --> 00:34:10,909 And this is really hard to provide 659 00:34:10,909 --> 00:34:16,070 because what it typically means is that the data is widely 660 00:34:16,070 --> 00:34:17,620 replicated or cached. 661 00:34:17,620 --> 00:34:19,219 Any time anybody changes the data 662 00:34:19,219 --> 00:34:23,025 you have to make sure that all of the copies get that change. 663 00:34:23,025 --> 00:34:25,150 And, even if you work really hard to invalidate all 664 00:34:25,150 --> 00:34:26,620 the entries and make changes to it, 665 00:34:26,620 --> 00:34:29,760 there are these small windows of vulnerability where -- 666 00:34:29,760 --> 00:34:32,610 In fact, in DNS, for example, even the first access 667 00:34:32,610 --> 00:34:35,159 that you make the server after the expiration time 668 00:34:35,159 --> 00:34:37,800 may not guaranty that when the response returns, 669 00:34:37,800 --> 00:34:40,060 the response is, in fact, the newest response 670 00:34:40,060 --> 00:34:42,639 because the primary name server could send a response. 671 00:34:42,639 --> 00:34:45,360 And, while it is coming back to the person who made the query, 672 00:34:45,360 --> 00:34:48,560 the data could get changed at the primary name server 673 00:34:48,560 --> 00:34:51,070 so it is really hard to guaranty this, 674 00:34:51,070 --> 00:34:54,364 at all points in time, in a distributed system. 675 00:34:54,364 --> 00:34:55,780 And it gets much harder when there 676 00:34:55,780 --> 00:34:58,490 are failures making certain copies unavailable 677 00:34:58,490 --> 00:35:01,840 or making access to a primary in the DNS case unavailable. 678 00:35:04,800 --> 00:35:07,480 In practice, in most systems, the kind 679 00:35:07,480 --> 00:35:14,970 of consistency that people try to get is eventual consistency 680 00:35:14,970 --> 00:35:17,600 or they try to approximate strong consistency 681 00:35:17,600 --> 00:35:18,730 in some other way. 682 00:35:18,730 --> 00:35:20,730 And eventually consistency just -- 683 00:35:20,730 --> 00:35:23,650 It is a little bit of a loser notion, but what it says 684 00:35:23,650 --> 00:35:26,030 is that there might be periods of time 685 00:35:26,030 --> 00:35:29,110 where things are consistent or that the system is doing work 686 00:35:29,110 --> 00:35:32,340 in the background to make sure that all 687 00:35:32,340 --> 00:35:35,930 of the copies of a given data item are, in fact, the same 688 00:35:35,930 --> 00:35:42,442 and are the result of the last write to that data. 689 00:35:42,442 --> 00:35:44,150 Again, the notion of eventual consistency 690 00:35:44,150 --> 00:35:45,524 depends a lot on the application. 691 00:35:45,524 --> 00:35:48,620 So, really, to specify this precisely you 692 00:35:48,620 --> 00:35:51,210 have to look at in the context of the application. 693 00:35:51,210 --> 00:35:53,350 Different applications you have different notions 694 00:35:53,350 --> 00:35:57,670 of consistency and eventual consistency. 695 00:35:57,670 --> 00:36:00,560 So we looked at DNS as an example. 696 00:36:00,560 --> 00:36:03,070 Another example to look at is something 697 00:36:03,070 --> 00:36:10,130 you might be familiar with which is "Web caches". 698 00:36:10,130 --> 00:36:14,130 Web caches, for example, your browser has a cache in it. 699 00:36:14,130 --> 00:36:16,820 And there might be Web caches located elsewhere 700 00:36:16,820 --> 00:36:19,370 in the network that capture your requests. 701 00:36:19,370 --> 00:36:22,760 And people use Web caches to save latency 702 00:36:22,760 --> 00:36:27,790 or to prevent slamming a Web server that might otherwise 703 00:36:27,790 --> 00:36:30,060 get overloaded. 704 00:36:30,060 --> 00:36:35,530 The semantics here are usually that you do not just 705 00:36:35,530 --> 00:36:36,740 return stale data. 706 00:36:36,740 --> 00:36:38,920 If the data has changed on the Web server, 707 00:36:38,920 --> 00:36:40,930 it might be that you actually want to return 708 00:36:40,930 --> 00:36:43,010 good data to the client. 709 00:36:43,010 --> 00:36:46,640 The way this is normally done is for the client 710 00:36:46,640 --> 00:36:50,540 or for any cache to first check with the Web server 711 00:36:50,540 --> 00:36:52,430 to see if the data has been changed 712 00:36:52,430 --> 00:36:54,960 since the last cached version. 713 00:36:54,960 --> 00:36:58,230 Let's say that the cache went to the Web server 714 00:36:58,230 --> 00:37:01,330 at 9:00 in the morning and had to go there because it did not 715 00:37:01,330 --> 00:37:03,110 have the data in the cache. 716 00:37:03,110 --> 00:37:04,890 And it got some data back. 717 00:37:04,890 --> 00:37:07,980 The data has a timestamp on it. 718 00:37:07,980 --> 00:37:11,880 Then the next time somebody makes a request to the cache, 719 00:37:11,880 --> 00:37:15,120 the cache does not just return the data immediately. 720 00:37:15,120 --> 00:37:17,980 What the cache usually does is to go to the Web server 721 00:37:17,980 --> 00:37:19,980 and ask the Web server if the data has changed 722 00:37:19,980 --> 00:37:21,640 since 9:00 in the morning. 723 00:37:21,640 --> 00:37:23,740 If the data has changed since 9:00 in the morning 724 00:37:23,740 --> 00:37:25,330 you might retrieve the data. 725 00:37:25,330 --> 00:37:27,160 You would retrieve the data for the server. 726 00:37:27,160 --> 00:37:31,340 If not then go ahead and return the data to the client. 727 00:37:31,340 --> 00:37:39,160 This is also called "If-Modified-Since" 728 00:37:39,160 --> 00:37:42,570 because what you are saying is the cache is telling the server 729 00:37:42,570 --> 00:37:46,400 send me the data if it has been modified since the last time I 730 00:37:46,400 --> 00:37:48,630 know the version of the data that I have. 731 00:37:48,630 --> 00:37:50,840 And a convenient way to represent that 732 00:37:50,840 --> 00:37:51,990 is as a timestamp. 733 00:37:51,990 --> 00:37:54,825 It's just a version of the data. 734 00:37:54,825 --> 00:37:56,200 So you can see that this actually 735 00:37:56,200 --> 00:37:59,900 provides a more stronger consistency semantics than DNS. 736 00:37:59,900 --> 00:38:01,740 Because in DNS the data could have changed 737 00:38:01,740 --> 00:38:07,800 and your cache just has outdated data. 738 00:38:07,800 --> 00:38:09,980 But for the application that DNS is used for 739 00:38:09,980 --> 00:38:16,170 it is perfectly OK for that to be the case. 740 00:38:16,170 --> 00:38:18,520 Now, in general, in distributed systems 741 00:38:18,520 --> 00:38:21,910 there is a tradeoff between the consistency 742 00:38:21,910 --> 00:38:26,910 of data at the different replicas and availability. 743 00:38:26,910 --> 00:38:29,690 Availability just means that clients wanting data 744 00:38:29,690 --> 00:38:33,249 should get some copy of the data. 745 00:38:33,249 --> 00:38:35,540 Now, if the system is strongly consistent then the copy 746 00:38:35,540 --> 00:38:37,880 of data that you get is, in fact, 747 00:38:37,880 --> 00:38:41,090 the result of the last write. 748 00:38:41,090 --> 00:38:44,980 But the tradeoff occurs between availability and consistency 749 00:38:44,980 --> 00:38:47,690 because in many distributed systems 750 00:38:47,690 --> 00:38:50,530 your networks are not reliable or nodes themselves are not 751 00:38:50,530 --> 00:38:52,120 reliable and they might fail. 752 00:38:52,120 --> 00:38:55,720 So in the presence of failures, say network partitions 753 00:38:55,720 --> 00:38:58,220 or failures of nodes, it turns out 754 00:38:58,220 --> 00:39:01,660 to be really hard to guaranty both high availability 755 00:39:01,660 --> 00:39:06,190 and strong consistency. 756 00:39:06,190 --> 00:39:09,410 As sort of a trivial existent example of this, 757 00:39:09,410 --> 00:39:11,080 if you have three copies of the data 758 00:39:11,080 --> 00:39:16,400 and you were not very careful about figuring out your 759 00:39:16,400 --> 00:39:17,160 write protocol. 760 00:39:17,160 --> 00:39:19,550 Let's say that your write protocol was to sort of write 761 00:39:19,550 --> 00:39:22,090 to one version and then your read protocol 762 00:39:22,090 --> 00:39:24,360 was to just read from some other version 763 00:39:24,360 --> 00:39:26,780 and for some process in the background 764 00:39:26,780 --> 00:39:29,370 to transfer the replica from the first version 765 00:39:29,370 --> 00:39:31,880 that the client wrote to, to all of the other copies, 766 00:39:31,880 --> 00:39:34,040 then there would be periods of time of the network 767 00:39:34,040 --> 00:39:36,350 where partitioned you could end up in a situation 768 00:39:36,350 --> 00:39:40,390 where the version that a given client is reading 769 00:39:40,390 --> 00:39:44,290 is not actually the last version of the data that was written. 770 00:39:44,290 --> 00:39:47,970 In fact, if you started thinking about DP2, Design Project 2, 771 00:39:47,970 --> 00:39:54,150 really, one part of it gets at how you manage replicated data. 772 00:39:54,150 --> 00:40:00,220 For example, when the utility that does the archiving 773 00:40:00,220 --> 00:40:02,040 publishes data, one approach it might take 774 00:40:02,040 --> 00:40:05,460 is to publish the data that it wants 775 00:40:05,460 --> 00:40:10,140 to archive to all of the copies, to all of the replica machines. 776 00:40:10,140 --> 00:40:15,490 And the read protocol might be to read from one of them. 777 00:40:15,490 --> 00:40:18,910 Now, if you insure that the write protocol finishes 778 00:40:18,910 --> 00:40:22,610 and succeeds only when all of the replica machines 779 00:40:22,610 --> 00:40:26,800 are updated then you can try to get at a decent version 780 00:40:26,800 --> 00:40:28,089 of consistency. 781 00:40:28,089 --> 00:40:30,380 But you need to be able to do that when failures occur. 782 00:40:30,380 --> 00:40:32,440 The network might fail or nodes might fail, 783 00:40:32,440 --> 00:40:37,010 and you need to figure out how to do that. 784 00:40:37,010 --> 00:40:38,990 But you might decide that writing to end copies 785 00:40:38,990 --> 00:40:43,260 and reading from one copy is difficult or has high overhead 786 00:40:43,260 --> 00:40:47,002 so you might think about ways of writing to certain subsets, 787 00:40:47,002 --> 00:40:48,460 writing to a subset of the machines 788 00:40:48,460 --> 00:40:50,060 and reading from a subset of the machines 789 00:40:50,060 --> 00:40:51,685 to try to see whether you could come up 790 00:40:51,685 --> 00:40:58,052 with ways to get a consistent version of the data. 791 00:40:58,052 --> 00:41:00,510 Or you might decide that the right way to solve the problem 792 00:41:00,510 --> 00:41:02,960 is not to try to achieve really strong consistency 793 00:41:02,960 --> 00:41:06,750 in all situations but to relax the kind of consistency 794 00:41:06,750 --> 00:41:11,324 you want and maybe a different version of semantics. 795 00:41:11,324 --> 00:41:13,240 As long as you are precise about the semantics 796 00:41:13,240 --> 00:41:14,910 that your system provides, it might 797 00:41:14,910 --> 00:41:16,610 be a different solution or reasonable solution 798 00:41:16,610 --> 00:41:17,234 to the problem. 799 00:41:29,520 --> 00:41:32,670 So one interesting way in which people 800 00:41:32,670 --> 00:41:34,700 achieve reasonable strong consistency 801 00:41:34,700 --> 00:41:39,070 in tightly coupled distributed systems, and distributed 802 00:41:39,070 --> 00:41:43,030 systems that are not across the Internet where a network could 803 00:41:43,030 --> 00:41:46,540 arbitrarily fail, but in more tightly coupled systems 804 00:41:46,540 --> 00:41:48,750 is in a multiprocessor. 805 00:41:51,490 --> 00:41:54,337 If you have a computer that has many processors -- 806 00:42:04,080 --> 00:42:07,460 And the abstraction here for this multiprocessor 807 00:42:07,460 --> 00:42:09,010 is that of shared memory. 808 00:42:09,010 --> 00:42:12,890 You actually have memory sitting outside here, 809 00:42:12,890 --> 00:42:15,300 and these processors are reading and writing 810 00:42:15,300 --> 00:42:19,070 data to this memory. 811 00:42:19,070 --> 00:42:22,530 The latency to get to memory and back is high. 812 00:42:22,530 --> 00:42:24,740 So, as you know, processors have caches on them. 813 00:42:31,810 --> 00:42:38,640 As long as the memory locations that are being written and read 814 00:42:38,640 --> 00:42:40,580 are not shared between them these caches 815 00:42:40,580 --> 00:42:42,774 could function just fine. 816 00:42:42,774 --> 00:42:44,440 And when there is an instruction running 817 00:42:44,440 --> 00:42:48,180 on one of these processors that wants to access some memory 818 00:42:48,180 --> 00:42:50,730 location, you could just read and write from the cache 819 00:42:50,730 --> 00:42:53,440 so things would just work out. 820 00:42:53,440 --> 00:42:57,120 The problem arises when there is a memory location being 821 00:42:57,120 --> 00:43:01,170 read here that actually was previously 822 00:43:01,170 --> 00:43:03,270 written by this processor. 823 00:43:03,270 --> 00:43:06,090 And, if you read it here, then you 824 00:43:06,090 --> 00:43:07,990 might get an old version of the data. 825 00:43:07,990 --> 00:43:12,520 And if you think of just memory as the basic abstraction, 826 00:43:12,520 --> 00:43:15,090 virtual memory then this is bad semantics 827 00:43:15,090 --> 00:43:18,210 because your programs wouldn't function the same way as they 828 00:43:18,210 --> 00:43:21,220 did when you just had one processor 829 00:43:21,220 --> 00:43:23,000 or when you didn't have the caches at all 830 00:43:23,000 --> 00:43:24,500 and you just went directly to memory 831 00:43:24,500 --> 00:43:26,650 from multiple processors. 832 00:43:26,650 --> 00:43:31,740 The question is how do you know whether the data in a cache 833 00:43:31,740 --> 00:43:34,060 is good or bad? 834 00:43:34,060 --> 00:43:37,940 Now, like in the Web caches case, checking on every access 835 00:43:37,940 --> 00:43:39,610 whether the data has changed is not 836 00:43:39,610 --> 00:43:42,050 going to be useful here because the amount of work 837 00:43:42,050 --> 00:43:45,210 it takes to check something is about the same as the amount 838 00:43:45,210 --> 00:43:47,030 of work it takes to read or write something 839 00:43:47,030 --> 00:43:51,350 because you have taken the latency hit for that. 840 00:43:51,350 --> 00:43:54,570 So that approach is not going to work. 841 00:43:54,570 --> 00:43:58,980 The solution that is followed in many systems 842 00:43:58,980 --> 00:44:00,300 is to use two ideas. 843 00:44:00,300 --> 00:44:03,850 The first idea is that of a "Write-Thru Cache". 844 00:44:03,850 --> 00:44:05,710 What a write-thru cache says is if there 845 00:44:05,710 --> 00:44:12,580 is a write that happens here, or store instruction, 846 00:44:12,580 --> 00:44:14,400 the cache gets updated. 847 00:44:14,400 --> 00:44:16,730 But, in addition to the cache getting updated, 848 00:44:16,730 --> 00:44:18,960 the data also gets written through 849 00:44:18,960 --> 00:44:22,300 on the bus to the memory location here. 850 00:44:22,300 --> 00:44:28,150 So that is the first idea, to use a write-thru cache. 851 00:44:38,390 --> 00:44:43,570 The second is because this is a bus all of these nodes 852 00:44:43,570 --> 00:44:45,339 can actually snoop on this bus and see 853 00:44:45,339 --> 00:44:47,880 what activity there is on the bus because it is a shared bus. 854 00:44:47,880 --> 00:44:50,010 It is a very special kind of network, as I said. 855 00:44:50,010 --> 00:44:51,640 You cannot apply this idea in general. 856 00:44:51,640 --> 00:44:53,400 It is a very special kind of network 857 00:44:53,400 --> 00:44:55,252 where because it is a bus and nothing fails, 858 00:44:55,252 --> 00:44:56,710 or the assumption is nothing fails, 859 00:44:56,710 --> 00:45:00,620 everybody can check to see what is going on on the bus. 860 00:45:00,620 --> 00:45:02,720 And any time there is any activity on the bus that 861 00:45:02,720 --> 00:45:06,720 corresponds to something that is stored in any node's cache 862 00:45:06,720 --> 00:45:08,210 you can do two things. 863 00:45:08,210 --> 00:45:10,830 You can actually invalidate that cache entry 864 00:45:10,830 --> 00:45:14,590 but you can actually also see what the update is and go ahead 865 00:45:14,590 --> 00:45:17,200 and look at the change that was being done 866 00:45:17,200 --> 00:45:19,610 and update your cache. 867 00:45:19,610 --> 00:45:22,970 And this idea is sometimes called a "Snoopy Cache" 868 00:45:22,970 --> 00:45:24,610 because you have these caches that 869 00:45:24,610 --> 00:45:29,380 are snooping on activity that is occurring in your system. 870 00:45:32,100 --> 00:45:34,880 And this is one way in which you can achieve something that 871 00:45:34,880 --> 00:45:37,000 resembles strong consistency. 872 00:45:37,000 --> 00:45:39,490 But it actually turns out, if you think hard about it, 873 00:45:39,490 --> 00:45:41,437 a precise version of strong consistency 874 00:45:41,437 --> 00:45:42,520 is really hard to achieve. 875 00:45:42,520 --> 00:45:44,780 In fact, it is very, very hard to even define 876 00:45:44,780 --> 00:45:49,100 what it means for any read to see 877 00:45:49,100 --> 00:45:51,000 the result of the last write because when 878 00:45:51,000 --> 00:45:54,190 you have multiple people reading and writing things, 879 00:45:54,190 --> 00:45:56,406 when you get down to the instruction level, 880 00:45:56,406 --> 00:45:58,280 it turns out to be really hard to even define 881 00:45:58,280 --> 00:45:59,370 the right semantics. 882 00:45:59,370 --> 00:46:04,010 A lot of people are working on this kind of thing. 883 00:46:04,010 --> 00:46:06,100 But this is a little bit of a special case 884 00:46:06,100 --> 00:46:07,850 because this kind of solution applies only 885 00:46:07,850 --> 00:46:11,240 in a very tightly coupled system where you do not really 886 00:46:11,240 --> 00:46:15,234 have failures and everybody can listen to everything else. 887 00:46:15,234 --> 00:46:16,900 But it is interesting to note that there 888 00:46:16,900 --> 00:46:18,400 are cases when you can achieve it 889 00:46:18,400 --> 00:46:19,550 and that is why this is interesting. 890 00:46:19,550 --> 00:46:20,660 It is practically useful. 891 00:46:26,550 --> 00:46:31,780 So the main thing about Design Project 2 892 00:46:31,780 --> 00:46:33,910 that relates to the consistency discussion 893 00:46:33,910 --> 00:46:36,620 is for you to try to, I mean at least one part of it, 894 00:46:36,620 --> 00:46:40,340 in case it was not clear from the description of the project 895 00:46:40,340 --> 00:46:43,300 is for you to think about what kind of consistency you want 896 00:46:43,300 --> 00:46:46,950 and come up with ways to manage these different replicas. 897 00:46:46,950 --> 00:46:49,940 We are going to stop here. 898 00:46:49,940 --> 00:46:52,600 Next week we will talk about multi-site atomicity. 899 00:46:52,600 --> 00:46:54,220 Tomorrow's recitation is on a system 900 00:46:54,220 --> 00:46:58,130 called Unison which also looks at consistency when you have 901 00:46:58,130 --> 00:47:01,000 mobile computers that are trying to synchronize data 902 00:47:01,000 --> 00:47:02,850 with servers.