1 00:00:00,040 --> 00:00:02,480 The following content is provided under a Creative 2 00:00:02,480 --> 00:00:04,010 Commons license. 3 00:00:04,010 --> 00:00:06,340 Your support will help MIT OpenCourseWare 4 00:00:06,340 --> 00:00:10,690 continue to offer high quality educational resources for free. 5 00:00:10,690 --> 00:00:13,320 To make a donation or view additional materials 6 00:00:13,320 --> 00:00:17,035 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,035 --> 00:00:17,660 at ocw.mit.edu. 8 00:00:21,264 --> 00:00:22,430 SRINIVAS DEVADAS: All right. 9 00:00:22,430 --> 00:00:23,672 Good morning, everyone. 10 00:00:23,672 --> 00:00:26,060 And let's get started. 11 00:00:26,060 --> 00:00:30,130 Today's lecture is about a randomized data structure 12 00:00:30,130 --> 00:00:32,200 called the skip list. 13 00:00:32,200 --> 00:00:37,390 And it's a data structure that, obviously because it's 14 00:00:37,390 --> 00:00:41,360 randomized, we'd have to do a probabilistic analysis for. 15 00:00:41,360 --> 00:00:44,740 And we're going to sort of raise the stakes here a little bit 16 00:00:44,740 --> 00:00:48,010 with respect to our expectation-- 17 00:00:48,010 --> 00:00:52,640 the pun intended of this data structure-- in the sense 18 00:00:52,640 --> 00:00:57,410 that we're not going to be happy with just doing an expected 19 00:00:57,410 --> 00:01:01,290 value analysis or to get what the expectation is 20 00:01:01,290 --> 00:01:05,129 of the search complexity in a skip list. 21 00:01:05,129 --> 00:01:10,030 We're going to introduce this notion with high probability, 22 00:01:10,030 --> 00:01:13,630 which is a stronger notion than just giving you the expected 23 00:01:13,630 --> 00:01:19,020 value or the expectation for the complexity of a search 24 00:01:19,020 --> 00:01:20,090 algorithm. 25 00:01:20,090 --> 00:01:24,690 And we're going to prove that under this notion, 26 00:01:24,690 --> 00:01:28,500 that search has a particular complexity 27 00:01:28,500 --> 00:01:30,520 with high probability. 28 00:01:30,520 --> 00:01:34,480 So we'll get to the with high probability part a little bit 29 00:01:34,480 --> 00:01:36,330 later in the lecture, but we're just 30 00:01:36,330 --> 00:01:41,470 going to start off doing some cool data structure design, 31 00:01:41,470 --> 00:01:45,210 I guess, [INAUDIBLE] pointing to the skip list. 32 00:01:45,210 --> 00:01:48,810 The skip list is a relatively young data structure 33 00:01:48,810 --> 00:01:54,290 invented by a guy called Bill Pugh in 1989, 34 00:01:54,290 --> 00:01:56,597 so not much older than you guys. 35 00:01:59,470 --> 00:02:04,240 It's relatively easy to implement as you'll see. 36 00:02:04,240 --> 00:02:07,480 I won't really claim that, but hopefully you'll 37 00:02:07,480 --> 00:02:11,700 be convinced by the time you're done describing the structure. 38 00:02:11,700 --> 00:02:18,004 Especially in comparison to balanced trees. 39 00:02:20,670 --> 00:02:24,860 And we can do a comparison after we do our analysis of the data 40 00:02:24,860 --> 00:02:30,290 structure as to what the complexity comparisons are 41 00:02:30,290 --> 00:02:33,920 for search and insert when you take a skip list 42 00:02:33,920 --> 00:02:37,310 and compare it to an AVL tree, for example, 43 00:02:37,310 --> 00:02:41,820 or a red black tree, et cetera. 44 00:02:41,820 --> 00:02:44,240 In general, when we have a data structure, 45 00:02:44,240 --> 00:02:46,840 we want it to be dynamic. 46 00:02:46,840 --> 00:02:53,530 The skip list maintains a dynamic set. 47 00:02:53,530 --> 00:02:55,970 What that means is not only do you 48 00:02:55,970 --> 00:02:58,810 want to search on it-- obviously it's 49 00:02:58,810 --> 00:03:03,170 uninteresting to have a static data structure and do a search. 50 00:03:03,170 --> 00:03:05,750 You want to be able to change it, want to be 51 00:03:05,750 --> 00:03:08,390 able to insert values into it. 52 00:03:08,390 --> 00:03:10,600 There's a complexity of insert to worry about. 53 00:03:10,600 --> 00:03:13,170 You want to be able to delete values. 54 00:03:13,170 --> 00:03:15,110 And the richness of the data structure 55 00:03:15,110 --> 00:03:17,800 comes from the operations and the augmentations 56 00:03:17,800 --> 00:03:22,440 you can do on it, and the skip lists are no exception to that. 57 00:03:22,440 --> 00:03:25,870 So if you want to maintain a dynamic set of n elements, 58 00:03:25,870 --> 00:03:29,320 and you obviously know a ton of data structures to do this, 59 00:03:29,320 --> 00:03:32,630 each of which has different characteristics. 60 00:03:32,630 --> 00:03:37,050 And this is, if you ignore hash tables, 61 00:03:37,050 --> 00:03:44,280 this is your first real randomized data structure, 62 00:03:44,280 --> 00:03:48,740 if you're just taking 6006 and this class 63 00:03:48,740 --> 00:03:51,875 might have seen randomized structures in other classes. 64 00:03:55,050 --> 00:04:00,320 So we're going to try and do this in order log n time. 65 00:04:00,320 --> 00:04:04,190 As you know with balanced binary trees, 66 00:04:04,190 --> 00:04:07,470 you can do things in order log n time, a ton of things, 67 00:04:07,470 --> 00:04:12,070 pretty much everything that is interesting. 68 00:04:12,070 --> 00:04:16,010 And this, given that it's randomized, 69 00:04:16,010 --> 00:04:19,360 it's a relatively easy analysis to show 70 00:04:19,360 --> 00:04:25,210 that the expectation or the expected value of a search 71 00:04:25,210 --> 00:04:28,750 would be order log n expected time. 72 00:04:28,750 --> 00:04:31,550 But we're going to, as I said, raise the stakes, 73 00:04:31,550 --> 00:04:36,020 and we're going to spend a ton of time 74 00:04:36,020 --> 00:04:40,610 the second half of this showing the with high probability 75 00:04:40,610 --> 00:04:43,850 result. And that's a stronger result 76 00:04:43,850 --> 00:04:47,790 than just saying that search takes expected order log n 77 00:04:47,790 --> 00:04:49,120 time. 78 00:04:49,120 --> 00:04:51,720 All right, so that's the context. 79 00:04:51,720 --> 00:04:55,720 You can think of a skip list as beginning 80 00:04:55,720 --> 00:05:00,800 with a simple linked list. 81 00:05:00,800 --> 00:05:08,640 So if we have one link list and that link list-- let's 82 00:05:08,640 --> 00:05:16,610 first think of this as being unsorted. 83 00:05:16,610 --> 00:05:19,660 So suppose I have a link list which is unsorted 84 00:05:19,660 --> 00:05:23,650 and I want to search for a particular value in this link 85 00:05:23,650 --> 00:05:24,380 list. 86 00:05:24,380 --> 00:05:29,320 And we can assume that this is a doubly-linked list, 87 00:05:29,320 --> 00:05:32,640 so the arrows go both ways. 88 00:05:32,640 --> 00:05:38,660 You have a pointer, let's say, just to the first element. 89 00:05:38,660 --> 00:05:41,920 So if you have a list that's unsorted 90 00:05:41,920 --> 00:05:45,030 and you want to search for an element, 91 00:05:45,030 --> 00:05:47,270 you would want to do a membership query. 92 00:05:47,270 --> 00:05:50,906 If there's n elements, the complexity is? 93 00:05:50,906 --> 00:05:51,860 AUDIENCE: Order n. 94 00:05:51,860 --> 00:05:53,140 SRINIVAS DEVADAS: Order n. 95 00:05:53,140 --> 00:06:00,370 So a linked list, the search takes the order n time. 96 00:06:00,370 --> 00:06:05,630 Now let's go ahead and say that we are sorting this list, 97 00:06:05,630 --> 00:06:07,930 so it's a sorted linked list. 98 00:06:07,930 --> 00:06:17,190 So your values here, 14, 23, 34, 42, 50, 59. 99 00:06:17,190 --> 00:06:20,850 They're sorted in ascending order. 100 00:06:20,850 --> 00:06:25,040 You still only have a pointer to the front of the list 101 00:06:25,040 --> 00:06:27,710 and it's a doubly-linked list, what 102 00:06:27,710 --> 00:06:34,090 is the complexity of search in the sorted link list? 103 00:06:34,090 --> 00:06:35,570 AUDIENCE: Log n. 104 00:06:35,570 --> 00:06:36,810 SRINIVAS DEVADAS: Log n. 105 00:06:36,810 --> 00:06:39,200 Oh, I wanted to hear that. 106 00:06:39,200 --> 00:06:40,724 Because it is? 107 00:06:40,724 --> 00:06:41,474 AUDIENCE: Order n. 108 00:06:41,474 --> 00:06:42,520 SRINIVAS DEVADAS: It's order n. 109 00:06:42,520 --> 00:06:44,400 log n is-- yeah, that was a trick question. 110 00:06:49,310 --> 00:06:51,910 Because I liked that answer, the person who said log n 111 00:06:51,910 --> 00:06:52,550 gets a Frisbee. 112 00:06:55,790 --> 00:06:56,800 This person won't admit. 113 00:06:56,800 --> 00:06:58,270 [LAUGHTER] 114 00:06:58,270 --> 00:06:59,210 Oh, it was you. 115 00:06:59,210 --> 00:07:00,190 OK, all right. 116 00:07:00,190 --> 00:07:02,200 There you go. 117 00:07:02,200 --> 00:07:03,940 All right. 118 00:07:03,940 --> 00:07:12,960 So log n would imply that you have random access. 119 00:07:12,960 --> 00:07:16,090 If you have an array that's sorted and you can go 120 00:07:16,090 --> 00:07:19,660 [? AFi, ?] and you can go [? AFi ?] divided by 2, 121 00:07:19,660 --> 00:07:23,150 or [? AF2i ?] and you can go directly to that element, 122 00:07:23,150 --> 00:07:27,650 then you can do binary search and you can get a log n, 123 00:07:27,650 --> 00:07:29,010 order log in. 124 00:07:29,010 --> 00:07:33,210 But here, the sorting actually doesn't help you 125 00:07:33,210 --> 00:07:38,200 with respect to the search simply because you 126 00:07:38,200 --> 00:07:40,890 have to start from the beginning, 127 00:07:40,890 --> 00:07:43,430 from the front of the list, and you've got to keep walking. 128 00:07:43,430 --> 00:07:47,390 The only place that it helps you is that if you know it's sorted 129 00:07:47,390 --> 00:07:50,850 and you're looking for 37, you can 130 00:07:50,850 --> 00:07:54,840 stop after you see 42, right? 131 00:07:54,840 --> 00:07:57,030 That's pretty much the only place that it helps you. 132 00:07:57,030 --> 00:07:58,740 But it's still order n because that 133 00:07:58,740 --> 00:08:00,650 could happen-- on average is going 134 00:08:00,650 --> 00:08:04,410 to happen halfway through the list for a given membership 135 00:08:04,410 --> 00:08:05,230 query. 136 00:08:05,230 --> 00:08:10,350 So it's still order n for a sorted link list. 137 00:08:10,350 --> 00:08:15,110 But now let's say that we had two sorted link lists. 138 00:08:20,480 --> 00:08:24,710 And how are these two link lists structured? 139 00:08:24,710 --> 00:08:27,080 Well, they're structured in a certain way, 140 00:08:27,080 --> 00:08:32,270 and let me draw our canonical example for skip list 141 00:08:32,270 --> 00:08:34,090 that I'm going to keep coming back to. 142 00:08:34,090 --> 00:08:39,974 So I won't erase this, but I'll draw one out-- 1, 2, 3, 4, 5, 143 00:08:39,974 --> 00:08:42,605 6, 7, 8, 9-- 9 elements. 144 00:08:56,060 --> 00:09:01,450 So that's my first list which is sorted, 145 00:09:01,450 --> 00:09:18,110 and so I have 14, 23, 34, 42, 50, 59, 66, 72, and 79. 146 00:09:18,110 --> 00:09:20,590 What I'm going to have now is another list sort 147 00:09:20,590 --> 00:09:22,830 of on top of this. 148 00:09:22,830 --> 00:09:31,660 I can move from top to bottom, et cetera. 149 00:09:31,660 --> 00:09:36,230 But I'm not going to have elements 150 00:09:36,230 --> 00:09:38,733 on top of each bottom element. 151 00:09:42,090 --> 00:09:44,650 By convention, I'm going to have elements 152 00:09:44,650 --> 00:09:47,490 on top of the first element, regardless 153 00:09:47,490 --> 00:09:48,870 of how many lists I have. 154 00:09:48,870 --> 00:09:50,840 We only have two at this point. 155 00:09:50,840 --> 00:09:54,100 And so I see a 14, which is exactly the same element 156 00:09:54,100 --> 00:09:57,280 duplicated up on the top list. 157 00:09:57,280 --> 00:09:59,970 And that list is also sorted, but I 158 00:09:59,970 --> 00:10:04,340 won't have all of the elements in the top list. 159 00:10:04,340 --> 00:10:06,860 I'm just picking a couple here. 160 00:10:06,860 --> 00:10:11,770 So I've got 34, 42-- they're adjacent here-- 161 00:10:11,770 --> 00:10:16,070 and then I go all the way up to 72, 162 00:10:16,070 --> 00:10:18,740 and I duplicate it, et cetera. 163 00:10:22,100 --> 00:10:24,227 Now, this looks kind of random. 164 00:10:24,227 --> 00:10:25,560 Anybody recognize these numbers? 165 00:10:31,460 --> 00:10:36,180 No one from the great City of New York? 166 00:10:36,180 --> 00:10:36,760 No? 167 00:10:36,760 --> 00:10:37,290 Yup, yup. 168 00:10:37,290 --> 00:10:38,540 AUDIENCE: On the subway stops? 169 00:10:38,540 --> 00:10:41,520 SRINIVAS DEVADAS: Yeah, subway stops on the Seventh Avenue 170 00:10:41,520 --> 00:10:42,465 Express Line. 171 00:10:49,640 --> 00:10:53,910 So this is exactly the notion of a skip list, the fact 172 00:10:53,910 --> 00:10:57,152 that you have-- could you stand up? 173 00:10:59,992 --> 00:11:02,970 Great. 174 00:11:02,970 --> 00:11:04,520 All right. 175 00:11:04,520 --> 00:11:09,210 So the notion here is that you don't 176 00:11:09,210 --> 00:11:14,370 have to make a lot of stops if you know you have to go far. 177 00:11:14,370 --> 00:11:20,340 So if you want to go from 14th Street to 72nd Street, 178 00:11:20,340 --> 00:11:22,960 you just take the express line. 179 00:11:22,960 --> 00:11:27,426 But if you want to go to 66th Street, what would you do? 180 00:11:27,426 --> 00:11:30,830 AUDIENCE: Go to 72nd and then go back. 181 00:11:30,830 --> 00:11:33,710 SRINIVAS DEVADAS: Well, that's one way. 182 00:11:33,710 --> 00:11:34,790 That's one way. 183 00:11:34,790 --> 00:11:37,156 That's not the way I wanted. 184 00:11:37,156 --> 00:11:38,780 The way we're going to do this is we're 185 00:11:38,780 --> 00:11:40,340 not going to overshoot. 186 00:11:40,340 --> 00:11:43,360 So we want to minimize distance, let's say. 187 00:11:43,360 --> 00:11:46,240 So our secondary thing is going to be 188 00:11:46,240 --> 00:11:49,730 minimizing distance travel. 189 00:11:49,730 --> 00:11:53,610 And so you're going to pop up the express line, go 190 00:11:53,610 --> 00:11:58,400 all the way to 42nd Street, and you're 191 00:11:58,400 --> 00:12:01,100 going to say if I go to the next stop on the Express Line, 192 00:12:01,100 --> 00:12:02,610 I'm going too far. 193 00:12:02,610 --> 00:12:05,630 And so you're going to pop down to the local line. 194 00:12:05,630 --> 00:12:09,240 So you can think of this as being link list L0 and link 195 00:12:09,240 --> 00:12:10,482 list L1. 196 00:12:10,482 --> 00:12:12,190 You're going to pop down, and then you're 197 00:12:12,190 --> 00:12:15,150 going to go to 66th Street. 198 00:12:15,150 --> 00:12:25,520 So search 66 will be going from 14 to 42 199 00:12:25,520 --> 00:12:34,730 on L1, and then from 42, let's just say that's walking. 200 00:12:34,730 --> 00:12:37,690 42 to 42, L1 to L0. 201 00:12:43,510 --> 00:12:47,310 And then 42 to 66 on L0. 202 00:12:50,360 --> 00:12:53,200 So that's the basic notion of a skip list. 203 00:12:53,200 --> 00:12:57,660 So you can see that it's really pretty simple. 204 00:12:57,660 --> 00:13:02,900 What we're going to do now is do two things. 205 00:13:02,900 --> 00:13:08,090 I want to think about this double-sorted list as a data 206 00:13:08,090 --> 00:13:12,590 structure in its own right before I dive into skip lists 207 00:13:12,590 --> 00:13:13,920 in general. 208 00:13:13,920 --> 00:13:21,990 And I want to analyze at some level, the best case situation 209 00:13:21,990 --> 00:13:24,870 for worst case complexity. 210 00:13:24,870 --> 00:13:30,970 And by that I mean I want to structure the express stops 211 00:13:30,970 --> 00:13:33,490 in the best manner possible. 212 00:13:33,490 --> 00:13:35,660 These stops are very structured for passengers 213 00:13:35,660 --> 00:13:38,980 because they figured fancy stops on 42nd Avenue, whatever-- 214 00:13:38,980 --> 00:13:40,190 fancy stores. 215 00:13:40,190 --> 00:13:42,980 Everybody wants to go there and so on and so forth. 216 00:13:42,980 --> 00:13:46,890 So you have 34 pretty close to 42 because they're both 217 00:13:46,890 --> 00:13:49,720 popular destinations. 218 00:13:49,720 --> 00:13:53,250 But let's say that things where I 219 00:13:53,250 --> 00:13:57,920 guess more egalitarian and randomized, if you will. 220 00:13:57,920 --> 00:14:00,310 And what I want to do is I want to structure 221 00:14:00,310 --> 00:14:07,030 this double-sorted list so I get the best worst case 222 00:14:07,030 --> 00:14:10,330 complexity for search. 223 00:14:10,330 --> 00:14:12,750 And so let's do that. 224 00:14:12,750 --> 00:14:20,019 And before I do that, let me write out the search algorithm, 225 00:14:20,019 --> 00:14:21,310 which is going to be important. 226 00:14:21,310 --> 00:14:24,960 I want you to assimilate this, keep it in your head 227 00:14:24,960 --> 00:14:27,000 because we're going to analyze search 228 00:14:27,000 --> 00:14:29,250 pretty much for the rest of the morning here. 229 00:14:29,250 --> 00:14:31,930 And so I'll write this down. 230 00:14:31,930 --> 00:14:37,280 You've got a sense of what it is based on what I just did here 231 00:14:37,280 --> 00:14:43,930 with this example of 66, but worth writing down. 232 00:14:43,930 --> 00:14:47,060 We're going to walk right in the top linked list, 233 00:14:47,060 --> 00:14:50,278 so this is simply for two linked lists, 234 00:14:50,278 --> 00:14:53,390 and we'll generalize at some point. 235 00:14:53,390 --> 00:15:00,260 So we want to walk right in the top linked list, L1, 236 00:15:00,260 --> 00:15:06,956 until going right would go too far. 237 00:15:11,060 --> 00:15:15,210 Now, there was this answer with 72, which I kind of dismissed. 238 00:15:15,210 --> 00:15:19,260 But there's no reason why you can't overshoot 239 00:15:19,260 --> 00:15:21,184 one stop and go backwards. 240 00:15:21,184 --> 00:15:23,100 It would just be a different search algorithm. 241 00:15:23,100 --> 00:15:25,380 It's not something we're going to analyze here. 242 00:15:25,380 --> 00:15:29,770 It turns out in analyzing that with high probability would 243 00:15:29,770 --> 00:15:32,840 be even more painful than the painful analysis 244 00:15:32,840 --> 00:15:34,290 we're going to do. 245 00:15:34,290 --> 00:15:35,355 So we won't go there. 246 00:15:39,690 --> 00:15:45,880 And then we walk down to the bottom list. 247 00:15:52,350 --> 00:15:54,940 And the bottom list we'll call L0. 248 00:15:57,930 --> 00:16:11,960 And walk right in L0 until the element is found or not. 249 00:16:11,960 --> 00:16:13,630 And you know that if you've overshot. 250 00:16:16,220 --> 00:16:21,130 So if you're looking here for route 67, when you get to 72 251 00:16:21,130 --> 00:16:24,540 here-- you've seen 66 and you get to 72 252 00:16:24,540 --> 00:16:27,450 and you're looking for 67, search fails. 253 00:16:27,450 --> 00:16:28,730 It stops and fails. 254 00:16:28,730 --> 00:16:30,880 Doesn't succeed in this case. 255 00:16:30,880 --> 00:16:34,870 So that's what we got for search. 256 00:16:34,870 --> 00:16:38,300 And that's our two linked list argument. 257 00:16:38,300 --> 00:16:45,040 Now, our analysis essentially says what I have is 258 00:16:45,040 --> 00:16:53,090 I'm walking right at the bottom list here, 259 00:16:53,090 --> 00:16:59,530 and my top list is L1, so I start with L1. 260 00:17:02,845 --> 00:17:04,780 And my search cost is going to be 261 00:17:04,780 --> 00:17:12,420 approximately the length of L1. 262 00:17:12,420 --> 00:17:17,470 The worst case analysis, I could go all the way 263 00:17:17,470 --> 00:17:22,030 on the top list-- it's possible. 264 00:17:22,030 --> 00:17:30,680 But for a given value, I'm going to be looking at only 265 00:17:30,680 --> 00:17:34,730 a portion of the bottom list. 266 00:17:34,730 --> 00:17:37,700 I'm not going to go all the way on the bottom list ever. 267 00:17:37,700 --> 00:17:40,250 I'm only going to be looking at a portion of it. 268 00:17:40,250 --> 00:17:44,220 So it's going to be L0 divided by L1, 269 00:17:44,220 --> 00:17:51,180 if I have interspersed my express stops in a uniform way. 270 00:17:51,180 --> 00:17:54,600 So there's no reason-- if I have 100 elements 271 00:17:54,600 --> 00:17:59,750 in the bottom list, and if I had five, 272 00:17:59,750 --> 00:18:02,160 just for argument sake, five in the top list, 273 00:18:02,160 --> 00:18:08,060 then I'd put them at, let's say, the 0 position, 20, 40, 60, 274 00:18:08,060 --> 00:18:09,260 et cetera. 275 00:18:09,260 --> 00:18:11,540 So I want to have roughly equal spacings. 276 00:18:11,540 --> 00:18:15,505 But we need to make that a little more concrete, 277 00:18:15,505 --> 00:18:17,600 and a little more precise. 278 00:18:17,600 --> 00:18:19,210 And what I'm saying here simply is 279 00:18:19,210 --> 00:18:23,300 that this is the cost of traversal in the top list, 280 00:18:23,300 --> 00:18:26,022 and this is the cost of traversal in the bottom list, 281 00:18:26,022 --> 00:18:28,480 because I'm not going to go all the way in the bottom list. 282 00:18:28,480 --> 00:18:31,650 I'm only going to go a portion on the bottom list. 283 00:18:31,650 --> 00:18:33,152 Everybody gets that? 284 00:18:33,152 --> 00:18:34,400 Yup? 285 00:18:34,400 --> 00:18:35,830 All right, good. 286 00:18:35,830 --> 00:18:38,860 So if I want to minimize this cost, which 287 00:18:38,860 --> 00:18:44,250 is going to tell me how to scatter 288 00:18:44,250 --> 00:18:50,230 these elements in the top list, how to choose my express stops, 289 00:18:50,230 --> 00:18:54,560 if you will-- I want to scatter these in a uniform way, 290 00:18:54,560 --> 00:19:01,908 then this is minimized when terms are equal. 291 00:19:04,920 --> 00:19:07,510 You could go off and differentiate and do that. 292 00:19:07,510 --> 00:19:09,670 It's fairly standard. 293 00:19:09,670 --> 00:19:16,950 And what you end up getting is you want to get L1 square 294 00:19:16,950 --> 00:19:20,510 equals L0 equals n. 295 00:19:20,510 --> 00:19:23,460 So all of the elements are down at the bottom list, 296 00:19:23,460 --> 00:19:27,730 and so the cardinality of the bottom list is n. 297 00:19:27,730 --> 00:19:34,980 And roughly speaking, you're going to end up optimizing, 298 00:19:34,980 --> 00:19:39,950 if you have this satisfied, which means that L1 is 299 00:19:39,950 --> 00:19:42,010 going to be square root of n. 300 00:19:42,010 --> 00:19:43,310 OK? 301 00:19:43,310 --> 00:19:46,870 So what you've done here is you've said a bunch of things, 302 00:19:46,870 --> 00:19:48,030 actually. 303 00:19:48,030 --> 00:19:50,760 You've decided how many elements are 304 00:19:50,760 --> 00:19:53,580 going to be in your top list. 305 00:19:53,580 --> 00:19:55,450 If there's n elements in the bottom list, 306 00:19:55,450 --> 00:19:59,580 you want to have the square root of n elements in the top list. 307 00:19:59,580 --> 00:20:02,100 And not only that, in order to make sure 308 00:20:02,100 --> 00:20:07,910 that this works properly, and that you don't get a worse case 309 00:20:07,910 --> 00:20:10,930 cost that is not optimal, you do have 310 00:20:10,930 --> 00:20:15,340 to intersperse the square root of n elements 311 00:20:15,340 --> 00:20:19,630 at regular intervals in relation to the bottom list 312 00:20:19,630 --> 00:20:21,020 on the top list. 313 00:20:21,020 --> 00:20:24,820 OK, so pictorially what this means is it's 314 00:20:24,820 --> 00:20:26,730 not what you have here. 315 00:20:26,730 --> 00:20:29,780 What you really want is something 316 00:20:29,780 --> 00:20:40,920 that, let's say, looks like this where this part here 317 00:20:40,920 --> 00:20:45,750 is square root of n elements up until that point, 318 00:20:45,750 --> 00:20:50,190 and then let's say we go from here to here 319 00:20:50,190 --> 00:20:52,930 or square root of n elements, and maybe I'll 320 00:20:52,930 --> 00:20:57,610 have a 66 here because that's exactly where I 321 00:20:57,610 --> 00:20:59,180 want my square root of n. 322 00:20:59,180 --> 00:21:02,130 Basically, three elements in between. 323 00:21:02,130 --> 00:21:06,870 So I got 66 here, et cetera. 324 00:21:06,870 --> 00:21:09,310 I mean I chose n to be a particular value here, 325 00:21:09,310 --> 00:21:12,230 but you get the picture. 326 00:21:12,230 --> 00:21:15,210 So the search now, as you can see if you just add those up 327 00:21:15,210 --> 00:21:17,240 you get square root of n here, and you 328 00:21:17,240 --> 00:21:19,270 got n divided by square root of n here. 329 00:21:19,270 --> 00:21:21,060 So that's square root of n as well. 330 00:21:21,060 --> 00:21:29,530 So the search cost is order square root of n. 331 00:21:29,530 --> 00:21:30,766 And so that's it. 332 00:21:30,766 --> 00:21:36,660 That's the first generalization, and really 333 00:21:36,660 --> 00:21:39,900 the most important one, that comes 334 00:21:39,900 --> 00:21:42,580 from going from a single sorted list 335 00:21:42,580 --> 00:21:47,370 to an approximation of a skip list. 336 00:21:47,370 --> 00:21:51,420 So what do you do if you want to make things better? 337 00:21:51,420 --> 00:21:52,970 So we want to make things better? 338 00:21:52,970 --> 00:21:54,609 Are we happy with square root of n? 339 00:21:54,609 --> 00:21:55,150 AUDIENCE: No. 340 00:21:55,150 --> 00:21:55,500 SRINIVAS DEVADAS: No. 341 00:21:55,500 --> 00:21:56,966 Well, what's our target? 342 00:21:56,966 --> 00:21:57,760 AUDIENCE: Log n. 343 00:21:57,760 --> 00:21:59,370 SRINIVAS DEVADAS: Log n, obviously. 344 00:21:59,370 --> 00:22:01,328 Well, I guess you can argue that our target may 345 00:22:01,328 --> 00:22:03,880 be order 1 at some point, but for today's lecture 346 00:22:03,880 --> 00:22:06,960 it is order log n with high probability. 347 00:22:06,960 --> 00:22:08,150 We'll leave it at that. 348 00:22:08,150 --> 00:22:14,940 And so what do you do if you want to go this way 349 00:22:14,940 --> 00:22:16,370 and generalize? 350 00:22:16,370 --> 00:22:18,525 You simply add more lists. 351 00:22:18,525 --> 00:22:20,650 I mean it seems to be pretty much the only thing we 352 00:22:20,650 --> 00:22:22,050 could do here. 353 00:22:22,050 --> 00:22:25,900 So let's go ahead and add a third list. 354 00:22:25,900 --> 00:22:32,710 So if you have two sorted lists, that 355 00:22:32,710 --> 00:22:34,560 implies I have 2 square root of n. 356 00:22:34,560 --> 00:22:37,240 If I want to be explicit about the constant in terms 357 00:22:37,240 --> 00:22:40,050 of the search cost, assuming things 358 00:22:40,050 --> 00:22:42,900 are interspersed exactly right. 359 00:22:42,900 --> 00:22:45,560 Keep that in mind because that is going to go away 360 00:22:45,560 --> 00:22:47,000 when we go and randomize. 361 00:22:47,000 --> 00:22:50,220 We're going to be flipping coins and things like that. 362 00:22:50,220 --> 00:22:53,880 But so far, things are very structured. 363 00:22:53,880 --> 00:22:59,690 What do you think-- we won't do this analysis-- the cost is 364 00:22:59,690 --> 00:23:04,410 going to be if I intersperse optimally, what 365 00:23:04,410 --> 00:23:08,830 is the cost going to be for a search 366 00:23:08,830 --> 00:23:12,575 when I have three sorted lists? 367 00:23:12,575 --> 00:23:13,624 AUDIENCE: Cube root. 368 00:23:13,624 --> 00:23:14,790 SRINIVAS DEVADAS: Cube root. 369 00:23:14,790 --> 00:23:15,540 Great guess. 370 00:23:15,540 --> 00:23:16,930 Who said cube root? 371 00:23:16,930 --> 00:23:17,960 AUDIENCE: [INAUDIBLE]. 372 00:23:17,960 --> 00:23:19,834 SRINIVAS DEVADAS: You already have a Frisbee. 373 00:23:19,834 --> 00:23:20,780 Give it to a friend. 374 00:23:20,780 --> 00:23:21,905 I need to get rid of these. 375 00:23:24,810 --> 00:23:29,730 So it's going to be cube root, and the constant 376 00:23:29,730 --> 00:23:31,116 in front of that is going to be? 377 00:23:31,116 --> 00:23:31,647 AUDIENCE: 3. 378 00:23:31,647 --> 00:23:32,480 SRINIVAS DEVADAS: 3. 379 00:23:32,480 --> 00:23:36,380 So you have-- right? 380 00:23:36,380 --> 00:23:37,910 So let's just keep going. 381 00:23:37,910 --> 00:23:39,860 You have k sorted lists. 382 00:23:39,860 --> 00:23:43,550 You're going to have k times the k-th root of n. 383 00:23:47,700 --> 00:23:49,310 That's what you got. 384 00:23:49,310 --> 00:23:51,070 And I'm not going to bother drawing this, 385 00:23:51,070 --> 00:23:53,160 but essentially what happens is you 386 00:23:53,160 --> 00:23:57,110 are making the same number of moves which corresponds 387 00:23:57,110 --> 00:24:01,180 to the root of n, the corresponding root of n, 388 00:24:01,180 --> 00:24:04,420 at every level. 389 00:24:04,420 --> 00:24:12,600 And the last thing we have to do to get a sense for what happens 390 00:24:12,600 --> 00:24:17,630 here is we have log n sorted lists, so the number of levels 391 00:24:17,630 --> 00:24:20,770 here is log n. 392 00:24:20,770 --> 00:24:24,840 So this is starting to look kind of familiar because it borrows 393 00:24:24,840 --> 00:24:26,530 from other data structures. 394 00:24:26,530 --> 00:24:31,390 And what this is I'm just going to substitute log n for k, 395 00:24:31,390 --> 00:24:36,240 and I got this kind of scary looking-- 396 00:24:36,240 --> 00:24:38,037 I was scared the first time I saw this. 397 00:24:41,306 --> 00:24:42,150 Oh, this is n. 398 00:24:45,050 --> 00:24:49,494 It's the log n-th root of n, OK? 399 00:24:49,494 --> 00:24:50,910 And so it's kind of scary looking. 400 00:24:50,910 --> 00:24:53,840 But what is the log n-th root of n-- and we can assume 401 00:24:53,840 --> 00:24:56,102 that n is a power of two? 402 00:24:56,102 --> 00:24:56,602 AUDIENCE: 2. 403 00:24:56,602 --> 00:24:58,396 SRINIVAS DEVADAS: 2, exactly. 404 00:24:58,396 --> 00:25:00,020 It's not that scary looking, and that's 405 00:25:00,020 --> 00:25:01,630 because I'm not a mathematician. 406 00:25:01,630 --> 00:25:03,710 That's why I was scared. 407 00:25:03,710 --> 00:25:06,930 So 2 log n. 408 00:25:06,930 --> 00:25:07,850 All right. 409 00:25:07,850 --> 00:25:10,140 So that's it. 410 00:25:10,140 --> 00:25:14,170 So you get a sense of how this works now, right? 411 00:25:14,170 --> 00:25:16,730 We haven't talked about randomized structures yet, 412 00:25:16,730 --> 00:25:18,860 but I've given you the template that's 413 00:25:18,860 --> 00:25:22,990 associated with the skip list, which essentially says what I'm 414 00:25:22,990 --> 00:25:29,990 going to have are-- if it was static data items and n was 415 00:25:29,990 --> 00:25:32,500 a power of two, then essentially what 416 00:25:32,500 --> 00:25:37,230 I'm saying is I'm going to have a bunch of items, n items, 417 00:25:37,230 --> 00:25:38,620 at the bottom. 418 00:25:38,620 --> 00:25:41,690 I'm going to have n over 2 items at the list that's 419 00:25:41,690 --> 00:25:44,280 just immediately above. 420 00:25:44,280 --> 00:25:46,620 And each of them are going to be alternating. 421 00:25:46,620 --> 00:25:48,570 You're going to have an item in between. 422 00:25:48,570 --> 00:25:52,210 And then on the top I'm going to see n over 4 items, 423 00:25:52,210 --> 00:25:53,940 and so on and so forth. 424 00:25:53,940 --> 00:25:56,270 What does that look like? 425 00:25:56,270 --> 00:25:57,710 Kind of looks like a tree, right? 426 00:25:57,710 --> 00:25:59,210 I mean it doesn't have the structure 427 00:25:59,210 --> 00:26:02,072 of a tree in the sense of the edges of a tree. 428 00:26:02,072 --> 00:26:03,530 It's quite different because you're 429 00:26:03,530 --> 00:26:06,300 connecting things differently. 430 00:26:06,300 --> 00:26:07,970 You have all the leaves connected down 431 00:26:07,970 --> 00:26:10,170 at the bottom of this so-called tree 432 00:26:10,170 --> 00:26:13,890 with this doubly linked list, but it has the triangle 433 00:26:13,890 --> 00:26:15,140 structure of a tree. 434 00:26:15,140 --> 00:26:17,490 And that's where the log n comes from. 435 00:26:17,490 --> 00:26:20,290 So this is would all be wonderful 436 00:26:20,290 --> 00:26:22,340 if this were a static set. 437 00:26:22,340 --> 00:26:25,910 And n doesn't have to be a power of 2-- you could pad it, 438 00:26:25,910 --> 00:26:27,180 and so on and so forth. 439 00:26:27,180 --> 00:26:29,400 But the big thing here is that we haven't quite 440 00:26:29,400 --> 00:26:32,900 accomplished what we set out to do, 441 00:26:32,900 --> 00:26:37,220 even though we seem to have this log n cost for search. 442 00:26:37,220 --> 00:26:41,840 But it's all based on a static set which doesn't change. 443 00:26:41,840 --> 00:26:45,010 And the problem, of course, is that you could have deletions. 444 00:26:45,010 --> 00:26:47,480 You want to take away 42. 445 00:26:47,480 --> 00:26:50,130 For some reason you can't go to 42nd Avenue, 446 00:26:50,130 --> 00:26:53,090 or I guess art-- you can't go to [INAUDIBLE] 447 00:26:53,090 --> 00:26:55,300 would be a better example. 448 00:26:55,300 --> 00:26:58,300 So stuff breaks, right? 449 00:26:58,300 --> 00:27:01,530 And so you take stuff out and you insert things in. 450 00:27:01,530 --> 00:27:06,650 Suppose I wanted to insert 60, 61, 62, 63, and 64 451 00:27:06,650 --> 00:27:08,180 into that list that I have? 452 00:27:08,180 --> 00:27:10,239 What would happen? 453 00:27:10,239 --> 00:27:11,530 Yeah, you're shaking your head. 454 00:27:11,530 --> 00:27:17,490 I mean that log n would go away, so it would be a problem. 455 00:27:17,490 --> 00:27:20,050 But what we have to do now is move 456 00:27:20,050 --> 00:27:22,320 to the probabilistic domain. 457 00:27:22,320 --> 00:27:23,990 We have to think about what happens 458 00:27:23,990 --> 00:27:25,190 when we insert elements. 459 00:27:25,190 --> 00:27:27,000 We need an algorithm for insert. 460 00:27:27,000 --> 00:27:30,510 So then we can start with the null list and build it up. 461 00:27:30,510 --> 00:27:32,990 And then you start with a null list 462 00:27:32,990 --> 00:27:35,410 and you have a randomized algorithm for insert, 463 00:27:35,410 --> 00:27:37,490 it ain't going to look that pretty. 464 00:27:37,490 --> 00:27:39,990 It's going to look random. 465 00:27:39,990 --> 00:27:42,460 But you have to have a certain amount of structure 466 00:27:42,460 --> 00:27:45,190 so you can still get your order log n. 467 00:27:45,190 --> 00:27:48,630 So you have to do the insertion appropriately. 468 00:27:48,630 --> 00:27:50,325 So that's what we have to do next. 469 00:27:50,325 --> 00:27:51,950 But any questions about that complexity 470 00:27:51,950 --> 00:27:52,832 that I have up there? 471 00:27:55,460 --> 00:27:56,780 All right, good. 472 00:28:03,280 --> 00:28:08,460 I want a canonical example of a list here, 473 00:28:08,460 --> 00:28:10,400 and I kind of ran out of room over there, 474 00:28:10,400 --> 00:28:23,220 so bear with me as I draw you a more sophisticated skip list 475 00:28:23,220 --> 00:28:25,900 that has a few more levels. 476 00:28:25,900 --> 00:28:30,090 And the reason for this is it's only interesting when 477 00:28:30,090 --> 00:28:33,100 you have three or more levels. 478 00:28:33,100 --> 00:28:35,330 The search algorithm is kind of the same. 479 00:28:35,330 --> 00:28:39,070 You go up top and when you overshoot you pop down 480 00:28:39,070 --> 00:28:43,670 one level, and then you do the same thing over and over. 481 00:28:43,670 --> 00:28:48,060 But we are going to have to bound 482 00:28:48,060 --> 00:28:52,740 the number of levels in the skip list in a probabilistic way. 483 00:28:52,740 --> 00:28:57,410 We have to actually discover the expected number of levels 484 00:28:57,410 --> 00:29:01,540 because we're going to be doing inserts in a randomized way. 485 00:29:01,540 --> 00:29:04,380 And so it's worthwhile having a picture that's 486 00:29:04,380 --> 00:29:07,750 a little more interesting than the picture of the two linked 487 00:29:07,750 --> 00:29:09,800 lists that I had up there. 488 00:29:09,800 --> 00:29:14,685 So I'm going to leave this on for the rest of the lecture. 489 00:29:23,000 --> 00:29:25,000 So that's our bottom, and that hasn't changed 490 00:29:25,000 --> 00:29:27,420 from our previous examples. 491 00:29:27,420 --> 00:29:34,450 I'm not going to bother drawing the horizontal connections. 492 00:29:34,450 --> 00:29:39,080 When you see things adjacent horizontally at the same level, 493 00:29:39,080 --> 00:29:43,896 assume that they're all connected-- all of them. 494 00:29:46,950 --> 00:29:52,440 And so I have four levels here. 495 00:29:52,440 --> 00:29:56,310 And you can think of this as being the entire list 496 00:29:56,310 --> 00:29:58,310 or part of it. 497 00:29:58,310 --> 00:30:02,685 Just to delineate things nicely, we'll 498 00:30:02,685 --> 00:30:07,940 assume that 79, which is the last element, 499 00:30:07,940 --> 00:30:10,150 is all the way up at the top as well. 500 00:30:10,150 --> 00:30:15,300 Sort of the terminus, termini, corresponding 501 00:30:15,300 --> 00:30:21,980 to our analogy of subways. 502 00:30:21,980 --> 00:30:24,400 And so that's our top-most level. 503 00:30:24,400 --> 00:30:35,470 And then I might have 50 here at this level, 504 00:30:35,470 --> 00:30:36,720 or so that looks like. 505 00:30:36,720 --> 00:30:39,320 I will have 50, so the invariant here, 506 00:30:39,320 --> 00:30:41,950 and that's another reason I want to draw this out, 507 00:30:41,950 --> 00:30:48,680 is that if you have a station at highest level, then 508 00:30:48,680 --> 00:30:54,000 you will have-- it's got to be sitting on something. 509 00:30:54,000 --> 00:30:57,180 So if you've got a 79 at level four, or level three 510 00:30:57,180 --> 00:31:06,200 here if this is L0, then you will see 79 at L2, L1, and L0. 511 00:31:06,200 --> 00:31:09,900 And if you see 50 here, it's not in L3 so that's OK, 512 00:31:09,900 --> 00:31:13,420 but it's in L2, so it's got to be at L1 as well. 513 00:31:13,420 --> 00:31:15,660 Of course you know that everything is down at L1, 514 00:31:15,660 --> 00:31:18,800 so this is interesting from a standpoint of the relationship 515 00:31:18,800 --> 00:31:25,220 between Li and Li plus 1 where i is greater than or equal to 1. 516 00:31:25,220 --> 00:31:33,440 So the implication is that if you see it at at Li plus 1, 517 00:31:33,440 --> 00:31:37,340 it's going to be at Li and Li minus 1 518 00:31:37,340 --> 00:31:39,940 if that happens to exist, et cetera. 519 00:31:39,940 --> 00:31:42,980 And so one last thing here just to finish it up. 520 00:31:42,980 --> 00:31:48,490 I got 34 here, which is an additional thing which 521 00:31:48,490 --> 00:31:49,160 ends there. 522 00:31:49,160 --> 00:31:54,060 So the highest level is this second level or L1. 523 00:31:54,060 --> 00:31:56,760 This is 66. 524 00:31:56,760 --> 00:31:58,850 And then that's it. 525 00:31:58,850 --> 00:32:01,910 So that's our skip list. 526 00:32:01,910 --> 00:32:05,910 So if you wanted to search for 72, you would start here, 527 00:32:05,910 --> 00:32:09,460 and then you'd go to 79, or you'd look and say, 528 00:32:09,460 --> 00:32:13,560 oh, 79 is too far, so I'm going to pop down a level. 529 00:32:13,560 --> 00:32:15,320 And then you'd say 50, oh, good. 530 00:32:15,320 --> 00:32:16,690 I can get to 50. 531 00:32:16,690 --> 00:32:19,520 79 is too far, so I'm going to pop down a level. 532 00:32:19,520 --> 00:32:23,820 And then you go to 66-- 79 is too far-- and at 66, 533 00:32:23,820 --> 00:32:29,220 you pop down a level and then you go 66 to 72. 534 00:32:29,220 --> 00:32:31,120 So same as what we had before. 535 00:32:31,120 --> 00:32:34,310 Hopefully it's not too complicated. 536 00:32:34,310 --> 00:32:36,990 So that's our skip list. 537 00:32:36,990 --> 00:32:40,710 It's still looking pretty structured, 538 00:32:40,710 --> 00:32:43,000 looking pretty regular. 539 00:32:43,000 --> 00:32:44,640 But if I start taking that and start 540 00:32:44,640 --> 00:32:46,590 inserting things and deleting things, 541 00:32:46,590 --> 00:32:48,680 it could become quite irregular. 542 00:32:48,680 --> 00:32:50,540 I could take away 23, for example. 543 00:32:50,540 --> 00:32:52,123 And there's nothing that's stopping me 544 00:32:52,123 --> 00:32:54,446 from taking away 34 or 79. 545 00:32:54,446 --> 00:32:56,070 You've got to delete an element, you've 546 00:32:56,070 --> 00:32:57,574 got to delete an element. 547 00:32:57,574 --> 00:32:59,240 I mean the fact that it's in four levels 548 00:32:59,240 --> 00:33:01,530 shouldn't make a difference. 549 00:33:01,530 --> 00:33:03,680 And so that's something to keep in mind. 550 00:33:03,680 --> 00:33:07,110 So this could get pretty messy. 551 00:33:07,110 --> 00:33:08,830 So let's talk about insert, and I've 552 00:33:08,830 --> 00:33:12,190 spent a bunch of time skirting around the issue of what 553 00:33:12,190 --> 00:33:15,910 exactly happens when you insert an element. 554 00:33:15,910 --> 00:33:18,260 Turns out delete is pretty easy. 555 00:33:18,260 --> 00:33:20,330 Insert is more interesting. 556 00:33:20,330 --> 00:33:21,090 Let's do insert. 557 00:33:43,540 --> 00:33:45,855 To insert an element x into a skip list, 558 00:33:45,855 --> 00:33:47,230 the first thing we're going to do 559 00:33:47,230 --> 00:33:59,530 is search to figure out where x fits into the bottom list. 560 00:33:59,530 --> 00:34:04,270 So you do a search just like you would if you were just 561 00:34:04,270 --> 00:34:06,990 doing a search. 562 00:34:06,990 --> 00:34:09,750 You always insert into the appropriate position. 563 00:34:09,750 --> 00:34:12,040 So if there's a single sorted list, 564 00:34:12,040 --> 00:34:13,248 that would pretty much be it. 565 00:34:16,400 --> 00:34:18,810 And so that part is easy. 566 00:34:18,810 --> 00:34:24,530 If you want to insert 67, you do all of the search operations 567 00:34:24,530 --> 00:34:26,270 that I just went over, and then you 568 00:34:26,270 --> 00:34:30,639 insert 67 between 66 and 72. 569 00:34:30,639 --> 00:34:33,949 So do your pointer manipulations, what have you, 570 00:34:33,949 --> 00:34:35,250 and you're good. 571 00:34:35,250 --> 00:34:38,250 But you're not done yet, because you want this to be a skip list 572 00:34:38,250 --> 00:34:41,730 and you want this to have expected search 573 00:34:41,730 --> 00:34:47,199 over any random query as the list grows and shrinks 574 00:34:47,199 --> 00:34:51,060 of order log n, expectation, and also with high probability. 575 00:34:51,060 --> 00:34:54,560 So what you're going to have to do is when you start inserting, 576 00:34:54,560 --> 00:34:56,820 you're going to have to decide if you're 577 00:34:56,820 --> 00:35:01,740 going to what is called promote these elements or not. 578 00:35:01,740 --> 00:35:05,240 And the notion of a promotion is that you 579 00:35:05,240 --> 00:35:09,500 are going up and duplicating this inserted element 580 00:35:09,500 --> 00:35:11,820 some number of levels up. 581 00:35:11,820 --> 00:35:16,520 So if you just look at how this works, 582 00:35:16,520 --> 00:35:18,490 it's really pretty straightforward. 583 00:35:18,490 --> 00:35:22,070 What is going to happen is simply that let's say I have 67 584 00:35:22,070 --> 00:35:25,000 and I'm going to insert it between 66 and 72. 585 00:35:25,000 --> 00:35:26,230 That much is a given. 586 00:35:26,230 --> 00:35:28,020 That is deterministic. 587 00:35:28,020 --> 00:35:33,550 Then I'm going to flip a coin or spin a Frisbee. 588 00:35:33,550 --> 00:35:36,180 I like this better. 589 00:35:36,180 --> 00:35:38,420 I'm not sure if this is biased or not. 590 00:35:38,420 --> 00:35:40,230 It's probably seriously biased. 591 00:35:40,230 --> 00:35:42,880 [LAUGHTER] 592 00:35:42,880 --> 00:35:47,650 Would it ever go the other way is the question. 593 00:35:47,650 --> 00:35:48,400 Would it ever? 594 00:35:48,400 --> 00:35:49,560 No. 595 00:35:49,560 --> 00:35:50,120 All right. 596 00:35:50,120 --> 00:35:51,860 So we've got a problem here. 597 00:35:51,860 --> 00:35:53,920 I think we might have to do something like that. 598 00:35:53,920 --> 00:35:57,030 [LAUGHTER] 599 00:35:57,030 --> 00:35:58,280 I'm procrastinating. 600 00:35:58,280 --> 00:36:00,280 I don't want to teach the rest of this material. 601 00:36:00,280 --> 00:36:05,630 [LAUGHTER] 602 00:36:05,630 --> 00:36:06,630 All right. 603 00:36:06,630 --> 00:36:08,840 Let's go, let's go. 604 00:36:08,840 --> 00:36:23,640 So I'd like to insert into some of the lists, 605 00:36:23,640 --> 00:36:25,420 and the big question is which ones? 606 00:36:30,060 --> 00:36:32,370 It's going to be really cool. 607 00:36:32,370 --> 00:36:36,560 I'm just going to flip coins, fair coins, 608 00:36:36,560 --> 00:36:42,855 and decide how much to promote these elements. 609 00:36:51,940 --> 00:36:57,790 So flip fair coin. 610 00:36:57,790 --> 00:37:13,000 If heads, promote x to the next level up, and repeat. 611 00:37:21,100 --> 00:37:26,400 Else, if you ever get a tails, you stop. 612 00:37:26,400 --> 00:37:29,030 And this next level up may be newly created. 613 00:37:35,370 --> 00:37:41,530 So what might happen with the 67 is that you stick it in here, 614 00:37:41,530 --> 00:37:44,520 and it might happen that the first time you flip you 615 00:37:44,520 --> 00:37:47,520 get a tails, in which case, 67 is going 616 00:37:47,520 --> 00:37:49,270 to just be at the bottom list. 617 00:37:49,270 --> 00:37:51,760 But if you get one heads, then you're not only 618 00:37:51,760 --> 00:37:54,020 going to put 67 in here, you're going 619 00:37:54,020 --> 00:37:55,880 to put 67 up here as well. 620 00:37:55,880 --> 00:37:59,160 And you're going to flip again. 621 00:37:59,160 --> 00:38:04,890 And if you get a heads again, you're going to put 67 up here. 622 00:38:04,890 --> 00:38:10,040 And if you get a heads again, you're going to put 67 up here. 623 00:38:10,040 --> 00:38:11,650 And if you get a heads again, you're 624 00:38:11,650 --> 00:38:13,790 going to create a new list up there, 625 00:38:13,790 --> 00:38:17,265 and at this point when you create the new list, 626 00:38:17,265 --> 00:38:20,420 it's only going to be 67 up there. 627 00:38:20,420 --> 00:38:24,330 And that's going to be the front of your list, 628 00:38:24,330 --> 00:38:27,210 because that's the one element that you're duplicating. 629 00:38:27,210 --> 00:38:30,660 So you're going to keep going until you get a tails. 630 00:38:30,660 --> 00:38:34,430 Now, that's why this coin had better be fair. 631 00:38:34,430 --> 00:38:36,310 So you're going to keep going and you're 632 00:38:36,310 --> 00:38:37,910 going to keep adding. 633 00:38:37,910 --> 00:38:40,950 Every time you insert there's a potential 634 00:38:40,950 --> 00:38:44,700 for increasing the number of levels in this list. 635 00:38:44,700 --> 00:38:47,810 Now, the number of levels is going 636 00:38:47,810 --> 00:38:52,270 to be bounded in expectation with a high probability 637 00:38:52,270 --> 00:38:55,220 of regular expectation, but I want 638 00:38:55,220 --> 00:38:57,820 to make it clear that every time you insert, 639 00:38:57,820 --> 00:38:59,690 if you get a chain of heads, you're 640 00:38:59,690 --> 00:39:02,130 going to be adding levels. 641 00:39:02,130 --> 00:39:06,550 And so the first time you get a tails, you just stop. 642 00:39:06,550 --> 00:39:08,090 You just stop. 643 00:39:08,090 --> 00:39:11,730 So you can see that this can get pretty messy pretty quick. 644 00:39:11,730 --> 00:39:14,180 And especially if you were starting from ground zero 645 00:39:14,180 --> 00:39:16,610 and adding 14, 23-- all of those things, 646 00:39:16,610 --> 00:39:18,120 the bottom is going to look exactly 647 00:39:18,120 --> 00:39:20,600 like it looks now because you're going to put it in there. 648 00:39:20,600 --> 00:39:21,910 It's deterministic. 649 00:39:21,910 --> 00:39:24,740 But the very next level after that looked pretty messy. 650 00:39:24,740 --> 00:39:26,560 You could have all of them chunked up here, 651 00:39:26,560 --> 00:39:29,000 and a big gap, et cetera, et cetera. 652 00:39:29,000 --> 00:39:33,600 So it's all about randomized search cost. 653 00:39:33,600 --> 00:39:37,920 The worse case cost here is going to be order n. 654 00:39:37,920 --> 00:39:40,030 Worst case cost is going to be order n, 655 00:39:40,030 --> 00:39:42,310 because you have no idea where these things are 656 00:39:42,310 --> 00:39:43,330 going to end up. 657 00:39:43,330 --> 00:39:47,020 But the randomized cost is what's cool about this. 658 00:39:47,020 --> 00:39:50,940 Any questions about insert or anything I said? 659 00:39:50,940 --> 00:39:51,887 Yeah, go ahead. 660 00:39:51,887 --> 00:39:53,512 AUDIENCE: Is worse case really order n? 661 00:39:53,512 --> 00:39:55,640 What if you had a really long, like a lot of lists 662 00:39:55,640 --> 00:39:58,200 on top of each other, and you start at the top of that 663 00:39:58,200 --> 00:40:01,530 and you had to walk all the way [INAUDIBLE]? 664 00:40:01,530 --> 00:40:05,820 SRINIVAS DEVADAS: Well, you go n down and n this way, right? 665 00:40:05,820 --> 00:40:08,589 You would be checking so it would be order n. 666 00:40:08,589 --> 00:40:10,130 AUDIENCE: So it's [? bounded ?] by n? 667 00:40:10,130 --> 00:40:11,980 SRINIVAS DEVADAS: Yeah, the worst case. 668 00:40:11,980 --> 00:40:13,790 AUDIENCE: Worse case is infinity. 669 00:40:13,790 --> 00:40:14,840 SRINIVAS DEVADAS: Worse case is infinity. 670 00:40:14,840 --> 00:40:15,890 Oh, in that sense, yeah. 671 00:40:15,890 --> 00:40:17,020 OK. 672 00:40:17,020 --> 00:40:19,769 Well, n elements, Eric is right. 673 00:40:19,769 --> 00:40:21,310 So what is happening here is that you 674 00:40:21,310 --> 00:40:23,700 have a small probability that you will 675 00:40:23,700 --> 00:40:27,840 keep flipping heads forever. 676 00:40:27,840 --> 00:40:32,140 So at some level, if you somehow take that away and use 677 00:40:32,140 --> 00:40:34,750 Frisbees instead or you truncate it. 678 00:40:34,750 --> 00:40:37,780 Let's say at some point you ended up saying that you only 679 00:40:37,780 --> 00:40:39,185 have n levels total. 680 00:40:42,310 --> 00:40:47,240 So it's not a-- I should have gone there. 681 00:40:47,240 --> 00:40:49,370 The question has to be posed a little more 682 00:40:49,370 --> 00:40:52,560 precisely for the answer to be order n. 683 00:40:52,560 --> 00:40:55,420 You have to have some more limitations to avoid the case 684 00:40:55,420 --> 00:40:59,960 that Eric just mentioned, which is in the randomized situation 685 00:40:59,960 --> 00:41:03,070 you will have the possibility of getting 686 00:41:03,070 --> 00:41:04,807 an infinite number of heads. 687 00:41:04,807 --> 00:41:05,890 Yeah, question back there. 688 00:41:05,890 --> 00:41:06,806 AUDIENCE: [INAUDIBLE]. 689 00:41:10,110 --> 00:41:13,060 SRINIVAS DEVADAS: Yes, you can certainly do capping 690 00:41:13,060 --> 00:41:17,100 and you can do a bunch of other things. 691 00:41:17,100 --> 00:41:20,290 It ends up becoming something which is not 692 00:41:20,290 --> 00:41:22,320 as clean as what you have here. 693 00:41:22,320 --> 00:41:25,090 The analysis is messy. 694 00:41:25,090 --> 00:41:28,792 And it's sort of in between a randomized data structure, 695 00:41:28,792 --> 00:41:30,250 a purely randomized data structure, 696 00:41:30,250 --> 00:41:31,840 and a deterministic one. 697 00:41:34,510 --> 00:41:37,130 I think the important thing to bring out here 698 00:41:37,130 --> 00:41:43,571 is the worst case is much worse than order log n, OK? 699 00:41:43,571 --> 00:41:44,070 Cool. 700 00:41:44,070 --> 00:41:44,569 Good. 701 00:41:44,569 --> 00:41:46,560 Thanks for those questions. 702 00:41:46,560 --> 00:41:52,720 And so what we have here now is an insert algorithm that could 703 00:41:52,720 --> 00:41:56,890 make things look pretty messy. 704 00:41:56,890 --> 00:42:00,310 I'm going to leave the insert up here, and that, of course, 705 00:42:00,310 --> 00:42:02,380 is part of that. 706 00:42:02,380 --> 00:42:04,890 Now, for the rest of the lecture we're 707 00:42:04,890 --> 00:42:08,810 going to talk about why skip lists are good. 708 00:42:08,810 --> 00:42:12,110 And we're going to justify this randomized data structure 709 00:42:12,110 --> 00:42:16,790 and show lots of nice results with respect 710 00:42:16,790 --> 00:42:20,680 to the expectation on the number of levels, expectation 711 00:42:20,680 --> 00:42:22,390 on the number of moves in a search, 712 00:42:22,390 --> 00:42:26,410 regardless of what items you're inserting and deleting. 713 00:42:26,410 --> 00:42:27,700 One last thing. 714 00:42:27,700 --> 00:42:31,900 To delete an item, you just delete it. 715 00:42:31,900 --> 00:42:40,258 You find it, search, and delete at all levels. 716 00:42:43,000 --> 00:42:45,150 So you can't leave it in any of the levels. 717 00:42:45,150 --> 00:42:47,800 So you find it, and you have to have the pointers set up 718 00:42:47,800 --> 00:42:51,980 properly-- move the previous pointer over 719 00:42:51,980 --> 00:42:54,550 to the next one, et cetera, et cetera. 720 00:42:54,550 --> 00:42:56,520 We won't get into that here, but you 721 00:42:56,520 --> 00:43:01,150 have to do the delete at every level. 722 00:43:01,150 --> 00:43:01,880 Yeah, question. 723 00:43:01,880 --> 00:43:04,380 AUDIENCE: So what happens if you inserted 10s 724 00:43:04,380 --> 00:43:06,380 and you flip off a tail? 725 00:43:06,380 --> 00:43:08,720 So that's like your first element is not 726 00:43:08,720 --> 00:43:12,772 going to go up all the way, and then have you do search. 727 00:43:12,772 --> 00:43:14,230 SRINIVAS DEVADAS: So typically what 728 00:43:14,230 --> 00:43:18,520 happens is you need to have a minus infinity here. 729 00:43:18,520 --> 00:43:19,559 And that's a good point. 730 00:43:19,559 --> 00:43:20,350 It's a corner case. 731 00:43:20,350 --> 00:43:21,933 You have to have a minus infinity that 732 00:43:21,933 --> 00:43:23,990 goes up all the way. 733 00:43:23,990 --> 00:43:25,240 Good question. 734 00:43:25,240 --> 00:43:28,790 So the question was what happens if I had something less than 14 735 00:43:28,790 --> 00:43:29,780 and I inserted it? 736 00:43:29,780 --> 00:43:31,760 Well, that doesn't happen because nothing 737 00:43:31,760 --> 00:43:35,040 is less than minus infinity, and that goes up all the way. 738 00:43:35,040 --> 00:43:37,740 But thanks for bringing it up. 739 00:43:37,740 --> 00:43:43,790 And so we're going to do a little warm-up Lemma. 740 00:43:43,790 --> 00:43:45,220 I don't know if you've ever heard 741 00:43:45,220 --> 00:43:51,520 these two terms in juxtaposition like this-- warm up and Lemma. 742 00:43:51,520 --> 00:43:54,330 But here you go, your first warm-up Lemma. 743 00:43:54,330 --> 00:43:57,330 I guess you'd never have a warm-up theorem. 744 00:43:57,330 --> 00:44:00,300 It's a warm-up Lemma for this theorem, which is 745 00:44:00,300 --> 00:44:04,060 going to take a while to prove. 746 00:44:04,060 --> 00:44:09,470 This comes down to trying to get a sense of how many levels 747 00:44:09,470 --> 00:44:12,540 you're going to have from a probabilistic standpoint. 748 00:44:12,540 --> 00:44:22,290 The number of levels in an n element skip list 749 00:44:22,290 --> 00:44:24,980 is order log n. 750 00:44:24,980 --> 00:44:29,740 And I'm going to now define the term with high probability. 751 00:44:29,740 --> 00:44:32,500 So what does this mean exactly? 752 00:44:32,500 --> 00:44:35,330 Well, what this means is order log n 753 00:44:35,330 --> 00:44:39,040 is something like c log n plus a constant. 754 00:44:39,040 --> 00:44:43,460 Let's ignore the constant and let's stick with c log n. 755 00:44:43,460 --> 00:44:48,640 And with high probability is a probability 756 00:44:48,640 --> 00:44:56,790 that is really a function of n and alpha. 757 00:44:56,790 --> 00:45:02,710 And you have this inverse polynomial relationship 758 00:45:02,710 --> 00:45:06,530 in the sense that obviously as n grows here, 759 00:45:06,530 --> 00:45:13,300 an alpha-- we'll assume that alpha is greater than the 1-- 760 00:45:13,300 --> 00:45:19,480 you are going to get a decrease in this quantity. 761 00:45:19,480 --> 00:45:23,341 So this is going to get closer and closer to 1 as n grows. 762 00:45:23,341 --> 00:45:25,590 So that's the difference between with high probability 763 00:45:25,590 --> 00:45:27,990 and just sort of giving you an expectation number where 764 00:45:27,990 --> 00:45:29,800 you have no such guarantees. 765 00:45:29,800 --> 00:45:33,360 What is interesting about this is that as n grows, 766 00:45:33,360 --> 00:45:36,940 you're going to get a higher and higher probability. 767 00:45:36,940 --> 00:45:41,742 And this constant c is going to be related to alpha. 768 00:45:41,742 --> 00:45:43,950 That's the other thing that's interesting about this. 769 00:45:43,950 --> 00:45:46,680 So it's like saying-- and you can kind of say this 770 00:45:46,680 --> 00:45:51,810 for using Chernoff bounds that we'll get to in a few minutes, 771 00:45:51,810 --> 00:45:54,890 even for expectation as well. 772 00:45:54,890 --> 00:46:00,980 But what this says is that if, for example, c doubled, 773 00:46:00,980 --> 00:46:06,620 then you are saying that your number of levels 774 00:46:06,620 --> 00:46:08,770 is order 4 log n. 775 00:46:08,770 --> 00:46:11,250 I mean I understand that that doesn't make too much sense, 776 00:46:11,250 --> 00:46:14,620 but it's less than or equal to 4 log n plus a constant. 777 00:46:14,620 --> 00:46:18,850 And that 4 is going to get reflected in the alpha here. 778 00:46:21,720 --> 00:46:25,380 When the 4 goes from 4 to 8, the alpha increases. 779 00:46:25,380 --> 00:46:30,600 So the more room that you have with respect to this constant, 780 00:46:30,600 --> 00:46:32,350 the higher the probability. 781 00:46:32,350 --> 00:46:34,760 It becomes an overwhelming probability 782 00:46:34,760 --> 00:46:38,190 that you're going to be within those number of levels. 783 00:46:38,190 --> 00:46:41,240 So maybe there's an 80% probability 784 00:46:41,240 --> 00:46:44,370 that you're within 2 log n. 785 00:46:44,370 --> 00:46:47,460 But there's a 99.99999% probability 786 00:46:47,460 --> 00:46:50,430 that you're within 4 log n, and so on and so forth. 787 00:46:50,430 --> 00:46:53,000 So that's the kind of thing that with the high probability 788 00:46:53,000 --> 00:46:56,630 analysis tells you explicitly. 789 00:46:56,630 --> 00:47:00,210 And so you can do that, you can do this analysis 790 00:47:00,210 --> 00:47:03,980 fairly straightforwardly. 791 00:47:03,980 --> 00:47:08,390 And let me do that on a different board. 792 00:47:08,390 --> 00:47:10,410 Let me go ahead and do that over here. 793 00:47:10,410 --> 00:47:12,328 Actually, I don't really need this. 794 00:47:12,328 --> 00:47:14,110 So let's do that over here. 795 00:47:18,830 --> 00:47:22,510 And so this is our first with high probability analysis. 796 00:47:22,510 --> 00:47:26,210 And I want to prove that warm-up Lemma. 797 00:47:26,210 --> 00:47:28,550 So usually what you do here is you look 798 00:47:28,550 --> 00:47:30,500 at the failure probability. 799 00:47:30,500 --> 00:47:32,940 So with high probability is typically 800 00:47:32,940 --> 00:47:35,980 something that looks like 1 minus 1 801 00:47:35,980 --> 00:47:38,100 divided by n raised to alpha. 802 00:47:38,100 --> 00:47:42,060 And this part here is the failure probability. 803 00:47:42,060 --> 00:47:43,750 And that's typically what you analyze 804 00:47:43,750 --> 00:47:46,430 and what we're going to do today. 805 00:47:46,430 --> 00:47:49,040 So the failure probability is that it's not less 806 00:47:49,040 --> 00:47:52,560 than c log n levels, is the complement of what we just 807 00:47:52,560 --> 00:47:57,250 looked at, which is the probability that it's strictly 808 00:47:57,250 --> 00:47:58,836 greater than c log n levels. 809 00:48:01,610 --> 00:48:14,710 And that's the probability that some element gets promoted 810 00:48:14,710 --> 00:48:16,265 greater than c log n times. 811 00:48:19,080 --> 00:48:24,120 So why would you have more than c log n levels? 812 00:48:24,120 --> 00:48:27,030 It's essentially because you inserted something 813 00:48:27,030 --> 00:48:30,930 and that element got promoted strictly greater than c 814 00:48:30,930 --> 00:48:35,160 log n times, which obviously implies that you 815 00:48:35,160 --> 00:48:37,320 had the sequence of heads, and we'll 816 00:48:37,320 --> 00:48:39,110 get to that in just a second. 817 00:48:39,110 --> 00:48:43,350 But before we go to that step of figuring out 818 00:48:43,350 --> 00:48:47,130 exactly what's going on here as to why this got promoted 819 00:48:47,130 --> 00:48:48,880 and what the probability of each promotion 820 00:48:48,880 --> 00:48:56,760 is, what I have here is I have a sequence of inserts 821 00:48:56,760 --> 00:48:58,790 potentially that I have to analyze. 822 00:48:58,790 --> 00:49:04,020 And in general, when I have an n element list, 823 00:49:04,020 --> 00:49:06,440 I'm going to assume that each of these elements 824 00:49:06,440 --> 00:49:09,540 got inserted into the list at some point. 825 00:49:09,540 --> 00:49:11,760 So I've had n inserts. 826 00:49:11,760 --> 00:49:16,050 And we just look at the case where you have n inserts, 827 00:49:16,050 --> 00:49:18,690 you could have deletes, and so you could have more inserts, 828 00:49:18,690 --> 00:49:20,980 but it won't really change anything. 829 00:49:20,980 --> 00:49:26,480 You have n inserts corresponding to each of these elements, 830 00:49:26,480 --> 00:49:31,310 and one of those n elements got promoted in this failure case 831 00:49:31,310 --> 00:49:34,380 greater than c log n times. 832 00:49:34,380 --> 00:49:36,320 That's essentially what's happened here. 833 00:49:36,320 --> 00:49:41,120 And so you don't know which one, but you can typically 834 00:49:41,120 --> 00:49:42,930 do this in with high probability analysis 835 00:49:42,930 --> 00:49:45,410 because the probabilities are so small 836 00:49:45,410 --> 00:49:50,545 and they're inverse polynomials, polynomials like n 837 00:49:50,545 --> 00:49:51,620 raised to alpha. 838 00:49:51,620 --> 00:49:53,330 You can use what's called the union bound 839 00:49:53,330 --> 00:49:58,180 that I'm sure you've used before in some context or the other. 840 00:49:58,180 --> 00:50:00,500 And you essentially say that this 841 00:50:00,500 --> 00:50:03,000 is less than or equal to the probability 842 00:50:03,000 --> 00:50:06,810 that a particular element x. 843 00:50:06,810 --> 00:50:10,740 So you just pick an element, arbitrary element x, 844 00:50:10,740 --> 00:50:12,720 but you pick one. 845 00:50:12,720 --> 00:50:18,830 Gets promoted greater than c log n times. 846 00:50:18,830 --> 00:50:21,440 So you have a small probability. 847 00:50:21,440 --> 00:50:24,760 You have no idea whether these events are independent or not. 848 00:50:24,760 --> 00:50:27,750 The union bound doesn't care about it. 849 00:50:27,750 --> 00:50:31,800 It's like saying you've got a 0.001 probability that any 850 00:50:31,800 --> 00:50:35,380 of these elements could get promoted greater than c log n 851 00:50:35,380 --> 00:50:38,980 times, and there's 10 of those elements. 852 00:50:38,980 --> 00:50:41,560 You don't know whether they're independent events or not, 853 00:50:41,560 --> 00:50:43,060 but you can certainly use the union 854 00:50:43,060 --> 00:50:46,620 bound that says the overall failure probability is going 855 00:50:46,620 --> 00:50:50,990 to be less than or equal to n equals 10, in my example, 856 00:50:50,990 --> 00:50:53,460 times that 0.001. 857 00:50:53,460 --> 00:50:55,680 That's basically it. 858 00:50:55,680 --> 00:50:58,460 Now you can go off and say, what does it 859 00:50:58,460 --> 00:51:00,700 mean for an element to get promoted? 860 00:51:00,700 --> 00:51:05,040 What actually has to happen for an element to get promoted? 861 00:51:05,040 --> 00:51:09,970 And you have n times 1 over 2, because you're 862 00:51:09,970 --> 00:51:13,250 flipping a fair coin, and you are 863 00:51:13,250 --> 00:51:18,950 getting a c log n heads here. 864 00:51:18,950 --> 00:51:22,230 You flip and you get one promotion. 865 00:51:25,705 --> 00:51:27,800 There's two levels associated with a promotion, 866 00:51:27,800 --> 00:51:31,330 the level you came from and the level you went to. 867 00:51:31,330 --> 00:51:33,980 And so a promotion is a move, so you're 868 00:51:33,980 --> 00:51:37,530 going to have one more level. 869 00:51:37,530 --> 00:51:40,740 If you count levels, then you have the number of promotions, 870 00:51:40,740 --> 00:51:41,670 right? 871 00:51:41,670 --> 00:51:46,510 That's just simply corresponds to taking this 1/2 872 00:51:46,510 --> 00:51:50,340 and raising it to c log n, because that's essentially 873 00:51:50,340 --> 00:51:55,330 the number of promotions you have. 874 00:51:55,330 --> 00:52:02,950 And you got n 1/2 c log n, and what does that turn into? 875 00:52:02,950 --> 00:52:07,990 What is n times 1/2 c log n? 876 00:52:07,990 --> 00:52:12,490 1 over 2 raised to log n would give you? 877 00:52:12,490 --> 00:52:14,790 2 raised to log ns? 878 00:52:14,790 --> 00:52:15,590 Is n, right? 879 00:52:15,590 --> 00:52:20,510 So you got n divided by n raised to c, which 880 00:52:20,510 --> 00:52:23,920 is 1 divided by n raised to c minus 1, 881 00:52:23,920 --> 00:52:27,500 which is 1 divided by n raised to alpha where alpha 882 00:52:27,500 --> 00:52:30,341 is c minus 1. 883 00:52:30,341 --> 00:52:30,882 So that's it. 884 00:52:30,882 --> 00:52:33,970 That's our first with high probability analysis. 885 00:52:33,970 --> 00:52:35,550 Not too hard. 886 00:52:35,550 --> 00:52:39,140 What I've done is done exactly what I just told you 887 00:52:39,140 --> 00:52:42,770 that the notion of with high probability is. 888 00:52:42,770 --> 00:52:48,050 You have a failure probability that is related. 889 00:52:48,050 --> 00:52:54,380 Inverse polynomial and the degree of the polynomial alpha 890 00:52:54,380 --> 00:52:55,621 is related to c. 891 00:52:55,621 --> 00:52:57,120 And so that's what I have out there, 892 00:52:57,120 --> 00:53:00,080 but c equals-- what did it have? 893 00:53:00,080 --> 00:53:04,510 Alpha equals c minus 1 or c equals alpha plus 1. 894 00:53:04,510 --> 00:53:07,090 So what I've done here is done an analysis 895 00:53:07,090 --> 00:53:10,480 that tells you with high probability how many levels 896 00:53:10,480 --> 00:53:14,610 I'm going to have given my insert algorithm. 897 00:53:14,610 --> 00:53:19,290 So this is the first part of what we'd like to show. 898 00:53:19,290 --> 00:53:22,050 This just tells us how big this skip list 899 00:53:22,050 --> 00:53:24,420 is going to grow vertically. 900 00:53:24,420 --> 00:53:29,140 It doesn't tell us anything about the structure of the list 901 00:53:29,140 --> 00:53:35,110 internally as to whether the randomization is going 902 00:53:35,110 --> 00:53:37,970 to cause that pretty structure that you see up 903 00:53:37,970 --> 00:53:42,710 here to be completely messed up to the point where we don't get 904 00:53:42,710 --> 00:53:46,280 order log n search complexity, because we are spending way too 905 00:53:46,280 --> 00:53:49,200 much time let's say on the bottom list or the list 906 00:53:49,200 --> 00:53:51,590 just above the bottom list, et cetera. 907 00:53:51,590 --> 00:53:57,580 So we need to get a sense of how the structure corresponding 908 00:53:57,580 --> 00:54:00,130 to the skip list, whether it's going to look somewhat uniform 909 00:54:00,130 --> 00:54:00,630 or not. 910 00:54:00,630 --> 00:54:02,810 We have to categorize that, and the only way 911 00:54:02,810 --> 00:54:04,320 we're going to characterize that is 912 00:54:04,320 --> 00:54:08,400 by analyzing search and counting the number of moves 913 00:54:08,400 --> 00:54:09,970 that a search makes. 914 00:54:09,970 --> 00:54:11,510 And the reason it's more complicated 915 00:54:11,510 --> 00:54:15,660 than what you see up there is that in a search, as you 916 00:54:15,660 --> 00:54:19,110 can see, you're going to be moving at different levels. 917 00:54:19,110 --> 00:54:21,500 You're going to be moving at the top level. 918 00:54:21,500 --> 00:54:24,410 Maybe at relatively small number of moves, 919 00:54:24,410 --> 00:54:28,130 you're going to pop down one, move a few moves at that level, 920 00:54:28,130 --> 00:54:30,160 pop down, et cetera, et cetera. 921 00:54:30,160 --> 00:54:32,680 So there's a lot of things going on in search which 922 00:54:32,680 --> 00:54:35,860 happen at different levels, and the total cost 923 00:54:35,860 --> 00:54:38,920 is going to have to be all of the moves. 924 00:54:38,920 --> 00:54:42,200 So we're going to think about all of the moves-- 925 00:54:42,200 --> 00:54:45,760 up moves, down moves, and add them all up. 926 00:54:45,760 --> 00:54:49,090 They all have to be order log n with high probability. 927 00:54:49,090 --> 00:54:52,310 There's no getting around that because each of them costs you. 928 00:54:52,310 --> 00:54:59,400 So that's the thing that we'll spend the next 20 minutes on. 929 00:54:59,400 --> 00:55:04,640 And the theorem that we like to prove for search 930 00:55:04,640 --> 00:55:08,140 is that-- this is what I just said-- 931 00:55:08,140 --> 00:55:26,920 any search in an n element skip list costs order log n w.h.p. 932 00:55:26,920 --> 00:55:30,780 So it doesn't matter how this skip list looks. 933 00:55:30,780 --> 00:55:32,800 There's n elements, they got inserted 934 00:55:32,800 --> 00:55:34,510 using the insert algorithm-- that's 935 00:55:34,510 --> 00:55:37,450 important to know if you're going to have to use that. 936 00:55:37,450 --> 00:55:41,300 And when I do a search for an element, it may be in there, 937 00:55:41,300 --> 00:55:42,854 it may not be in there. 938 00:55:42,854 --> 00:55:43,770 Doesn't really matter. 939 00:55:43,770 --> 00:55:46,940 We'll assume a successful search. 940 00:55:46,940 --> 00:55:51,200 That is going to cost me order log n with high probability. 941 00:55:51,200 --> 00:55:55,130 And the cool idea here in terms of analyzing the search 942 00:55:55,130 --> 00:55:58,930 in order to figure out how we're going to add up 943 00:55:58,930 --> 00:56:01,280 all of these moves is we're going to analyze 944 00:56:01,280 --> 00:56:04,470 the search backwards. 945 00:56:04,470 --> 00:56:05,460 So that's a cool idea. 946 00:56:09,350 --> 00:56:12,780 So what does that mean exactly? 947 00:56:12,780 --> 00:56:15,890 Well, what that means is that we're 948 00:56:15,890 --> 00:56:18,280 going to think about this b search, 949 00:56:18,280 --> 00:56:24,470 which think of it as the backward search, starts-- 950 00:56:24,470 --> 00:56:28,540 it actually ends, so that's what I'm writing in brackets here, 951 00:56:28,540 --> 00:56:31,000 at the node in the bottom list. 952 00:56:31,000 --> 00:56:35,300 So we're assuming a successful search, as I mentioned before. 953 00:56:35,300 --> 00:56:40,520 Otherwise, the point would just be in between two members. 954 00:56:40,520 --> 00:56:44,180 You know that it's not in there because you're looking for 67 955 00:56:44,180 --> 00:56:48,820 and you see 66 to your left and 72 to your right. 956 00:56:48,820 --> 00:56:51,940 So either way it works, but keep in mind 957 00:56:51,940 --> 00:56:55,340 that it's a successful search because it just makes 958 00:56:55,340 --> 00:56:58,310 things a little bit easier. 959 00:56:58,310 --> 00:57:06,760 Now, at each node that we visit, what we're going to do 960 00:57:06,760 --> 00:57:16,920 is we're going to say that if the node was not promoted 961 00:57:16,920 --> 00:57:20,480 higher, then what actually happened here 962 00:57:20,480 --> 00:57:24,030 was that when you inserted that particular element, 963 00:57:24,030 --> 00:57:26,330 you got a tails. 964 00:57:26,330 --> 00:57:29,020 Because otherwise you would have gotten a heads, 965 00:57:29,020 --> 00:57:31,820 that element would have been promoted higher. 966 00:57:31,820 --> 00:57:38,250 Then you go-- and that really means 967 00:57:38,250 --> 00:57:44,200 that you came from the left-hand side, so you make a left move. 968 00:57:44,200 --> 00:57:47,820 Now, search of course makes down moves and right moves, 969 00:57:47,820 --> 00:57:50,680 but this is a backward search so it's going to make left moves 970 00:57:50,680 --> 00:57:53,750 and up moves. 971 00:57:53,750 --> 00:57:55,390 What else do I have here? 972 00:57:55,390 --> 00:58:06,400 Running out of room, so let me-- let's continue with that. 973 00:58:18,100 --> 00:58:19,320 All right. 974 00:58:19,320 --> 00:58:29,050 And now the case is if the node was promoted 975 00:58:29,050 --> 00:58:34,510 higher, that means we got heads here 976 00:58:34,510 --> 00:58:36,790 in that particular insertion. 977 00:58:36,790 --> 00:58:43,990 Then we go, and that means that during the search 978 00:58:43,990 --> 00:58:49,280 we came from upstairs. 979 00:58:49,280 --> 00:58:52,830 And then lastly, we stop, which means 980 00:58:52,830 --> 00:59:06,070 we start when we reach the top level or minus infinity if we 981 00:59:06,070 --> 00:59:08,630 go all the way back. 982 00:59:08,630 --> 00:59:10,100 So that's it. 983 00:59:10,100 --> 00:59:13,360 A lot of writing here, but this should make things clear. 984 00:59:13,360 --> 00:59:18,020 So let's say that we're searching for 66. 985 00:59:18,020 --> 00:59:20,430 I want to trace through what the backwards path would 986 00:59:20,430 --> 00:59:24,890 look like, and keep that code in mind as I do this. 987 00:59:24,890 --> 00:59:27,717 So I'm searching for 66, and obviously, we 988 00:59:27,717 --> 00:59:28,550 know how to find it. 989 00:59:28,550 --> 00:59:29,470 We've done that. 990 00:59:29,470 --> 00:59:32,650 But let's go backwards as to what exactly 991 00:59:32,650 --> 00:59:36,380 happened when we look for 66. 992 00:59:36,380 --> 00:59:42,230 When we look for 66, right at this point when you see 66, 993 00:59:42,230 --> 00:59:43,790 where would you have come from? 994 00:59:43,790 --> 00:59:44,665 AUDIENCE: [INAUDIBLE] 995 00:59:44,665 --> 00:59:46,880 SRINIVAS DEVADAS: You'd have come from the top. 996 00:59:46,880 --> 00:59:50,600 And so if you go look at what happens here, 997 00:59:50,600 --> 00:59:54,800 the node when it got inserted was promoted one level. 998 00:59:54,800 --> 00:59:59,591 So that means that you would go up top in the backward search 999 00:59:59,591 --> 01:00:00,090 first. 1000 01:00:00,090 --> 01:00:03,140 Your first move would be going up like that. 1001 01:00:03,140 --> 01:00:07,390 Now, if there's a 66 up there, you would go up one more. 1002 01:00:07,390 --> 01:00:09,340 But there's not, so you go left. 1003 01:00:12,270 --> 01:00:13,720 You go to 50. 1004 01:00:13,720 --> 01:00:17,310 And when you have a 50 up here, would you stay on this level? 1005 01:00:17,310 --> 01:00:18,291 AUDIENCE: No. 1006 01:00:18,291 --> 01:00:19,166 SRINIVAS DEVADAS: No. 1007 01:00:19,166 --> 01:00:22,690 You'd go up to 50 because the first chance 1008 01:00:22,690 --> 01:00:26,020 you get you want to get up to the higher levels. 1009 01:00:26,020 --> 01:00:28,215 And again, this 50 was promoted so you go up there, 1010 01:00:28,215 --> 01:00:33,230 and you go to 14, and pretty much that's the end of that. 1011 01:00:33,230 --> 01:00:38,860 So this would look like you go like that, you have an up move, 1012 01:00:38,860 --> 01:00:42,860 then you have a left move-- different colors here 1013 01:00:42,860 --> 01:00:47,570 would be good-- then you have an up move, 1014 01:00:47,570 --> 01:00:52,270 and a left, and then an up. 1015 01:00:52,270 --> 01:00:55,140 So that's our backward search. 1016 01:00:55,140 --> 01:00:58,940 And it's not that complicated, hopefully. 1017 01:00:58,940 --> 01:01:01,980 If you're looking for 66 or 59, you do that. 1018 01:01:01,980 --> 01:01:05,710 So it's much more natural, and you just need to flip it. 1019 01:01:05,710 --> 01:01:07,520 Why am I doing all this? 1020 01:01:07,520 --> 01:01:09,720 Well, the reason I'm doing all this 1021 01:01:09,720 --> 01:01:15,770 is that I have to do some bounding of the moves, 1022 01:01:15,770 --> 01:01:21,350 and I know that the moves that correspond to the up moves 1023 01:01:21,350 --> 01:01:25,280 are probabilistic in the sense that the reason I'm making them 1024 01:01:25,280 --> 01:01:29,170 is because I flipped heads at some point. 1025 01:01:29,170 --> 01:01:32,560 So all of this is going to turn into counting 1026 01:01:32,560 --> 01:01:36,350 how many coin flips come out heads 1027 01:01:36,350 --> 01:01:39,672 in a long stream of coin flips. 1028 01:01:39,672 --> 01:01:41,130 So that's what this backward search 1029 01:01:41,130 --> 01:01:42,630 is going to allow us to do. 1030 01:01:42,630 --> 01:01:47,270 And that crucial thing is what we'll look at next. 1031 01:01:47,270 --> 01:01:50,730 So the analysis itself is a bit painful, 1032 01:01:50,730 --> 01:01:52,139 but there's a bunch of algebra. 1033 01:01:52,139 --> 01:01:53,680 But what I want to do is to make sure 1034 01:01:53,680 --> 01:01:57,950 that you get the high level picture, number one, 1035 01:01:57,950 --> 01:02:08,500 and the insights as to why the expected value or the with 1036 01:02:08,500 --> 01:02:10,810 high probability value is going to be order log n. 1037 01:02:10,810 --> 01:02:13,164 But the key is the strategy. 1038 01:02:13,164 --> 01:02:14,580 So we're going to go off and we're 1039 01:02:14,580 --> 01:02:15,746 going to prove this theorem. 1040 01:02:22,080 --> 01:02:38,330 Our backward search makes up moves and left moves. 1041 01:02:38,330 --> 01:02:38,900 We know that. 1042 01:02:42,480 --> 01:02:48,310 Each with probability 1/2. 1043 01:02:48,310 --> 01:02:52,910 And the reason for that is when you go up 1044 01:02:52,910 --> 01:02:55,050 is because you got a heads, and if you 1045 01:02:55,050 --> 01:02:58,880 didn't get a heads in you got a tails, that meant you go left. 1046 01:02:58,880 --> 01:03:01,960 Because of the previous element, every time you're 1047 01:03:01,960 --> 01:03:06,230 passing these elements that are inserted, 1048 01:03:06,230 --> 01:03:09,660 and they were inserted by flipping coins. 1049 01:03:09,660 --> 01:03:13,436 So that's key point number one. 1050 01:03:13,436 --> 01:03:15,310 All of that, if you look at what happens here 1051 01:03:15,310 --> 01:03:17,650 when I drew this out, you got heads here 1052 01:03:17,650 --> 01:03:19,630 and you got tails there. 1053 01:03:19,630 --> 01:03:21,460 So each of those things for a fair coin 1054 01:03:21,460 --> 01:03:23,370 is happening with probability 1/2. 1055 01:03:23,370 --> 01:03:26,120 And it's all about coin flips here. 1056 01:03:26,120 --> 01:03:38,700 Now, the number of moves going up 1057 01:03:38,700 --> 01:03:44,750 is less than the number of levels-- the number of levels 1058 01:03:44,750 --> 01:03:46,100 is one more than that. 1059 01:03:46,100 --> 01:03:52,230 And we've shown that that's c log n with high probability 1060 01:03:52,230 --> 01:03:53,480 by the warm-up Lemma. 1061 01:03:53,480 --> 01:03:55,370 That's what this just did. 1062 01:03:55,370 --> 01:03:59,540 The number of up moves-- I mean you can't go off the list here. 1063 01:03:59,540 --> 01:04:01,720 This list is now you're not inserting anymore, 1064 01:04:01,720 --> 01:04:02,840 you're doing a search. 1065 01:04:02,840 --> 01:04:04,750 So it's not like you're going to be adding 1066 01:04:04,750 --> 01:04:06,460 levels or anything like that. 1067 01:04:06,460 --> 01:04:09,070 So the number of up moves we've taken care of. 1068 01:04:09,070 --> 01:04:11,970 So this last thing here which I'm going to write out here 1069 01:04:11,970 --> 01:04:15,600 is the key observation, which is going to make 1070 01:04:15,600 --> 01:04:17,880 the whole analysis possible. 1071 01:04:17,880 --> 01:04:23,400 And so this last thing it says that the total number 1072 01:04:23,400 --> 01:04:27,260 of moves-- so now the total number of moves has to include, 1073 01:04:27,260 --> 01:04:28,820 obviously, the up moves and the left 1074 01:04:28,820 --> 01:04:30,470 moves, and there's no other kind. 1075 01:04:33,146 --> 01:04:38,770 The total number of moves is going 1076 01:04:38,770 --> 01:04:51,258 to correspond to the number of moves 1077 01:04:51,258 --> 01:05:04,317 till you get c log n up moves. 1078 01:05:07,570 --> 01:05:09,720 So what does that mean? 1079 01:05:09,720 --> 01:05:11,530 There's some sequence of heads and tails 1080 01:05:11,530 --> 01:05:15,270 that I'm getting, each of them with probability 1/2. 1081 01:05:15,270 --> 01:05:19,090 Every time that I got a heads, I moved up a level. 1082 01:05:19,090 --> 01:05:23,140 The fact of the matter is that I can't get more than c log n 1083 01:05:23,140 --> 01:05:27,000 heads because I'm going to run out of levels. 1084 01:05:27,000 --> 01:05:28,980 That's it. 1085 01:05:28,980 --> 01:05:33,530 I'm going to run out of room vertically if I keep popping up 1086 01:05:33,530 --> 01:05:35,700 and keep doing up moves. 1087 01:05:35,700 --> 01:05:39,484 So at that point I'm forced to go left. 1088 01:05:39,484 --> 01:05:40,900 Maybe I'm going left in the middle 1089 01:05:40,900 --> 01:05:44,220 there when I still had a chance to go up. 1090 01:05:44,220 --> 01:05:47,390 That corresponds to getting a tails as opposed to a heads. 1091 01:05:47,390 --> 01:05:50,910 But I can limit the total number of moves 1092 01:05:50,910 --> 01:05:53,850 from a probabilistic standpoint by saying 1093 01:05:53,850 --> 01:05:57,370 during that sequence of coin flips I only 1094 01:05:57,370 --> 01:05:59,500 have a certain number of heads that I 1095 01:05:59,500 --> 01:06:01,080 could have possibly gotten. 1096 01:06:01,080 --> 01:06:04,910 Because if I got more heads than that, I would be up top. 1097 01:06:04,910 --> 01:06:10,120 I'd be out of the skip list, and that doesn't work. 1098 01:06:10,120 --> 01:06:13,210 So the total number of moves is the number of moves 1099 01:06:13,210 --> 01:06:18,720 till you get c log n up moves, which essentially corresponds 1100 01:06:18,720 --> 01:06:24,210 to-- now, forget about skip lists for a second. 1101 01:06:24,210 --> 01:06:28,590 Our claim is the total number of moves 1102 01:06:28,590 --> 01:06:33,950 is the number of coin flips, so these are the same, 1103 01:06:33,950 --> 01:06:37,090 because every move corresponds to a coin flip. 1104 01:06:37,090 --> 01:06:41,720 Until-- it's a fair coin, probability 1/2-- 1105 01:06:41,720 --> 01:06:49,620 until c log n heads have been obtained. 1106 01:06:49,620 --> 01:06:52,800 So the number of coin flips until c 1107 01:06:52,800 --> 01:06:56,880 log n heads is the total number of moves. 1108 01:06:56,880 --> 01:06:57,920 This equals that. 1109 01:07:00,450 --> 01:07:06,600 And what we now want to show, if you believe that, and hopefully 1110 01:07:06,600 --> 01:07:08,480 you do because the argument is simply 1111 01:07:08,480 --> 01:07:15,740 that you run out of levels, that this is order log n w.h.p. 1112 01:07:15,740 --> 01:07:17,450 That's why it's a claim. 1113 01:07:17,450 --> 01:07:21,550 So the observation is that the number of coin 1114 01:07:21,550 --> 01:07:24,330 flips, as you flip a fair coin, until you 1115 01:07:24,330 --> 01:07:28,700 get c log n heads will give you the number of moves 1116 01:07:28,700 --> 01:07:33,270 in your search, total number of moves in your search. 1117 01:07:33,270 --> 01:07:35,880 It includes the up moves as well as the left moves. 1118 01:07:35,880 --> 01:07:41,220 And now what we have to show is that that 1119 01:07:41,220 --> 01:07:44,150 is going to be order log n with high probability. 1120 01:07:44,150 --> 01:07:45,240 OK? 1121 01:07:45,240 --> 01:07:48,650 And then once you do that you've done two things. 1122 01:07:48,650 --> 01:07:55,830 You've bounded the number of levels in the skip list 1123 01:07:55,830 --> 01:07:58,910 to be order log n with high probability. 1124 01:07:58,910 --> 01:08:01,470 And you've said the number of moves in the search 1125 01:08:01,470 --> 01:08:06,110 is order log n with high probability assuming 1126 01:08:06,110 --> 01:08:11,240 that the number of levels is c log n, obviously. 1127 01:08:11,240 --> 01:08:15,650 So it's not that the bottom one subsumes the top one. 1128 01:08:15,650 --> 01:08:18,560 It's the last thing to keep in mind as we get all 1129 01:08:18,560 --> 01:08:22,520 of these items out of the way. 1130 01:08:22,520 --> 01:08:26,439 This assumes that there are less than or equal to c log n 1131 01:08:26,439 --> 01:08:27,155 levels. 1132 01:08:27,155 --> 01:08:29,279 That's the only reason why I could make an argument 1133 01:08:29,279 --> 01:08:31,149 that I've run out of levels. 1134 01:08:31,149 --> 01:08:35,036 So if I have this event A here-- if I call this event A, 1135 01:08:35,036 --> 01:08:39,510 and I have this event B, what I really want 1136 01:08:39,510 --> 01:08:43,390 is-- I've shown you that event A happens with high probability. 1137 01:08:43,390 --> 01:08:45,149 That's the warm-up Lemma. 1138 01:08:45,149 --> 01:08:48,649 I need to show you that event B happens with high probability. 1139 01:08:48,649 --> 01:08:51,680 And then I have to show you that event A and event B 1140 01:08:51,680 --> 01:08:56,490 happen with high probability, because I need both. 1141 01:08:56,490 --> 01:08:57,149 Any questions? 1142 01:08:57,149 --> 01:08:59,460 We're stopping a minute here. 1143 01:08:59,460 --> 01:09:01,870 The rest of the analysis, a bunch of algebra, 1144 01:09:01,870 --> 01:09:03,910 we'll get through it, you can look at the notes. 1145 01:09:03,910 --> 01:09:05,920 This is the key point. 1146 01:09:05,920 --> 01:09:08,762 If you got this, you got it. 1147 01:09:08,762 --> 01:09:09,262 Yeah. 1148 01:09:09,262 --> 01:09:11,553 AUDIENCE: Can you just say that because the probability 1149 01:09:11,553 --> 01:09:15,869 of drawing an up move instead of a left move 1150 01:09:15,869 --> 01:09:21,265 is 1/2, that the expected number of left moves 1151 01:09:21,265 --> 01:09:25,227 should be equal to the number of up moves, [INAUDIBLE] 1152 01:09:25,227 --> 01:09:26,649 bound the up moves? 1153 01:09:26,649 --> 01:09:28,229 SRINIVAS DEVADAS: So the argument 1154 01:09:28,229 --> 01:09:32,410 is that since you have 1/2, can you 1155 01:09:32,410 --> 01:09:37,470 simply say that the expected number of left moves 1156 01:09:37,470 --> 01:09:40,490 is going to be the same as the same as the up moves? 1157 01:09:40,490 --> 01:09:42,790 You can make arguments about expectation. 1158 01:09:42,790 --> 01:09:46,200 You can say that at any level, the number of left moves 1159 01:09:46,200 --> 01:09:50,090 that you're going to have is going to be two in expectation. 1160 01:09:50,090 --> 01:09:54,290 It's not going to give you your with high probability proof. 1161 01:09:54,290 --> 01:09:57,410 It's not going to relate that to the 1 divided 1162 01:09:57,410 --> 01:09:58,630 by n raised to alpha. 1163 01:09:58,630 --> 01:10:02,430 But I will tell you that if you just wanted to show expectation 1164 01:10:02,430 --> 01:10:04,990 for search is order log n, you won't 1165 01:10:04,990 --> 01:10:08,400 have to jump through all of these hoops. 1166 01:10:08,400 --> 01:10:11,270 At some level you'll be making the assumptions 1167 01:10:11,270 --> 01:10:13,927 that I've made explicit here through my observations 1168 01:10:13,927 --> 01:10:15,135 when you do that expectation. 1169 01:10:15,135 --> 01:10:19,540 So if you really want to write a precise proof of expected 1170 01:10:19,540 --> 01:10:22,320 value for search complexity, you would 1171 01:10:22,320 --> 01:10:25,880 have to do a lot of the things that I'm doing here. 1172 01:10:25,880 --> 01:10:27,380 I'm not saying you waved your hands. 1173 01:10:27,380 --> 01:10:30,120 You did not. 1174 01:10:30,120 --> 01:10:34,220 But it needed more to than what you just said. 1175 01:10:34,220 --> 01:10:35,820 OK? 1176 01:10:35,820 --> 01:10:40,580 So this is pretty much what the analysis is. 1177 01:10:40,580 --> 01:10:43,800 With high probability analysis we bounded the vertical, 1178 01:10:43,800 --> 01:10:45,920 we bounded the number of moves. 1179 01:10:45,920 --> 01:10:48,710 Assuming the vertical was bounded, 1180 01:10:48,710 --> 01:10:51,350 we got the result for the number of moves. 1181 01:10:51,350 --> 01:10:53,720 So both of those happen with high probability. 1182 01:10:53,720 --> 01:10:56,570 You got your result, which is the theorem 1183 01:10:56,570 --> 01:11:00,950 that we have somewhere. 1184 01:11:00,950 --> 01:11:02,770 Woah, did I erase the theorem? 1185 01:11:02,770 --> 01:11:04,150 AUDIENCE: [INAUDIBLE]. 1186 01:11:04,150 --> 01:11:05,525 SRINIVAS DEVADAS: It's somewhere. 1187 01:11:06,901 --> 01:11:07,400 All right. 1188 01:11:07,400 --> 01:11:08,270 Good. 1189 01:11:08,270 --> 01:11:10,780 So let's do what we can with respect 1190 01:11:10,780 --> 01:11:14,980 to showing this theorem. 1191 01:11:14,980 --> 01:11:17,810 There's a couple ways that you could prove this. 1192 01:11:17,810 --> 01:11:26,910 There's a way that you could use a Chernoff bound. 1193 01:11:26,910 --> 01:11:29,840 And this is kind of a cool result 1194 01:11:29,840 --> 01:11:32,700 that I think is worth knowing. 1195 01:11:32,700 --> 01:11:34,430 I don't know if you've seen this, 1196 01:11:34,430 --> 01:11:38,430 but this is a seminal theorem by Chernoff 1197 01:11:38,430 --> 01:11:55,220 that says if you have a random variable representing 1198 01:11:55,220 --> 01:12:00,700 the total number of tails, let's say-- 1199 01:12:00,700 --> 01:12:08,310 it could be heads as well-- in a series of m-- 1200 01:12:08,310 --> 01:12:22,110 not n, m-- independent coin flips where each flip has 1201 01:12:22,110 --> 01:12:30,200 a probability p of coming up heads, 1202 01:12:30,200 --> 01:12:38,750 then for all r greater than 0, we have 1203 01:12:38,750 --> 01:12:45,040 this beautiful result that says the probability that y, 1204 01:12:45,040 --> 01:12:53,320 which is a random variable-- a particular instance 1205 01:12:53,320 --> 01:12:58,980 when you evaluate it-- that it is larger 1206 01:12:58,980 --> 01:13:03,210 than the expectation by r is bounded. 1207 01:13:03,210 --> 01:13:07,560 So just a beautiful result that says here's 1208 01:13:07,560 --> 01:13:12,520 a random variable that corresponds to flipping a coin. 1209 01:13:12,520 --> 01:13:15,700 I'm going to flip this a bunch of times, 1210 01:13:15,700 --> 01:13:17,510 and I know what the expectation is. 1211 01:13:17,510 --> 01:13:21,790 If it's a fair coin of 1/2, then I'm 1212 01:13:21,790 --> 01:13:24,400 going to get m over 2-- expected number of heads 1213 01:13:24,400 --> 01:13:25,760 is going to be m over 2. 1214 01:13:25,760 --> 01:13:28,040 Expected number of tails is going to be m over 2. 1215 01:13:28,040 --> 01:13:30,190 If it's p, then obviously it's a little bit 1216 01:13:30,190 --> 01:13:32,600 different-- p times m. 1217 01:13:32,600 --> 01:13:37,850 But what I have here is if you tell me what the probability is 1218 01:13:37,850 --> 01:13:40,500 that I'm 10 away from the expectation 1219 01:13:40,500 --> 01:13:44,670 and that would imply that r is 10, then that is bounded by e 1220 01:13:44,670 --> 01:13:48,240 raised to minus 2 times 10 square divided by m. 1221 01:13:48,240 --> 01:13:50,369 So that's Chernoff's bound. 1222 01:13:50,369 --> 01:13:52,910 And you can see how this relates to our with high probability 1223 01:13:52,910 --> 01:13:53,860 analysis. 1224 01:13:53,860 --> 01:13:55,290 Because our with high probability 1225 01:13:55,290 --> 01:13:57,110 analysis is exactly this. 1226 01:13:57,110 --> 01:14:00,830 This is the hammer that you can use to do with high probability 1227 01:14:00,830 --> 01:14:01,690 analysis. 1228 01:14:01,690 --> 01:14:04,730 Because this tells you as you get further and further away 1229 01:14:04,730 --> 01:14:07,460 from the average or you get further and further away 1230 01:14:07,460 --> 01:14:10,090 from the expectation, what the probability is that you're 1231 01:14:10,090 --> 01:14:11,960 going to be so far away. 1232 01:14:11,960 --> 01:14:19,260 What is the probability that in 100 coin flips that are fair, 1233 01:14:19,260 --> 01:14:22,390 you get 50 heads? 1234 01:14:22,390 --> 01:14:25,440 It's a reasonably large number because the expected value 1235 01:14:25,440 --> 01:14:28,590 corresponds to 50. 1236 01:14:28,590 --> 01:14:30,470 So r is 0. 1237 01:14:30,470 --> 01:14:32,755 So that just says this is a-- well, 1238 01:14:32,755 --> 01:14:35,130 it doesn't tell you much because this says it's less than 1239 01:14:35,130 --> 01:14:36,750 or equal to 1. 1240 01:14:36,750 --> 01:14:38,400 That's all it's says. 1241 01:14:38,400 --> 01:14:43,360 But if you had 75, what are the probability that you 1242 01:14:43,360 --> 01:14:48,370 get 75 heads when you flip a coin 100 times? 1243 01:14:48,370 --> 01:14:53,390 Then e of y for a fair coin would be 50, r would be 25, 1244 01:14:53,390 --> 01:14:56,220 and you'd go off and you could do the math for that. 1245 01:14:56,220 --> 01:14:59,670 So it's a beautiful relationship that tells you 1246 01:14:59,670 --> 01:15:05,050 how the probabilities change as your random variable value is 1247 01:15:05,050 --> 01:15:07,880 further and further away from the expectation. 1248 01:15:07,880 --> 01:15:09,900 And you can imagine that this is going 1249 01:15:09,900 --> 01:15:19,110 to be very useful in showing our with high probability result. 1250 01:15:19,110 --> 01:15:22,760 And I think what I have time for is just 1251 01:15:22,760 --> 01:15:27,810 to give you a sense of how this result works out-- I'm not 1252 01:15:27,810 --> 01:15:28,960 going to do the algebra. 1253 01:15:28,960 --> 01:15:32,610 I don't think it's worth it to write all of this on the board 1254 01:15:32,610 --> 01:15:35,340 when you can read it in the notes. 1255 01:15:35,340 --> 01:15:37,260 But the bottom line is we're going 1256 01:15:37,260 --> 01:15:47,730 to show this little Lemma that says for any c, 1257 01:15:47,730 --> 01:15:53,330 invoking this Chernoff bound, there's a constant d, 1258 01:15:53,330 --> 01:16:05,406 such that with high probability, the number of heads 1259 01:16:05,406 --> 01:16:09,510 in flipping d log n. 1260 01:16:09,510 --> 01:16:11,240 So I have a new constant here. 1261 01:16:11,240 --> 01:16:15,830 d log n fair coins, or a single fair coin, 1262 01:16:15,830 --> 01:16:20,040 d log n times, assuming independence, 1263 01:16:20,040 --> 01:16:23,380 is at least c log n. 1264 01:16:23,380 --> 01:16:24,780 So what does this say? 1265 01:16:24,780 --> 01:16:26,390 A lot of words. 1266 01:16:26,390 --> 01:16:32,320 It just says, hey, you want an order log n 1267 01:16:32,320 --> 01:16:34,270 bound here eventually. 1268 01:16:34,270 --> 01:16:36,590 The beauty of order log n is that there's a constant 1269 01:16:36,590 --> 01:16:38,760 in there that you control. 1270 01:16:38,760 --> 01:16:41,420 That constant is d. 1271 01:16:41,420 --> 01:16:46,530 So you tell me that c log n is 50. 1272 01:16:46,530 --> 01:16:49,590 So c log n is 50. 1273 01:16:49,590 --> 01:16:52,570 Then what I'm going to do is I'm going to say something like, 1274 01:16:52,570 --> 01:17:00,760 well, if I flip a coin 1,000 times, then 1275 01:17:00,760 --> 01:17:02,970 I'm going to have an overwhelming probability 1276 01:17:02,970 --> 01:17:06,070 that I'm going to get 50 heads. 1277 01:17:06,070 --> 01:17:06,860 And that's it. 1278 01:17:06,860 --> 01:17:10,040 That's what the Lemma says. 1279 01:17:10,040 --> 01:17:12,900 It says tell me what c log n is. 1280 01:17:12,900 --> 01:17:14,430 Give me that value. 1281 01:17:14,430 --> 01:17:18,970 And I will find you a d, such that by invoking Chernoff, 1282 01:17:18,970 --> 01:17:22,250 I'm going to show you an overwhelming probability that 1283 01:17:22,250 --> 01:17:25,924 for that d you're going to get at least c log n heads. 1284 01:17:25,924 --> 01:17:26,840 So everybody buy that? 1285 01:17:26,840 --> 01:17:30,117 Make sense from what you see up there? 1286 01:17:30,117 --> 01:17:31,520 Yup? 1287 01:17:31,520 --> 01:17:33,920 So this essentially can be shown-- 1288 01:17:33,920 --> 01:17:35,640 it turns out that what you have to do 1289 01:17:35,640 --> 01:17:38,110 is-- and you don't have to choose 8, 1290 01:17:38,110 --> 01:17:41,620 but you can choose d equals 8c. 1291 01:17:41,620 --> 01:17:44,030 Just choose d equals 8c and you'll 1292 01:17:44,030 --> 01:17:48,100 see the algebra in the notes corresponding to what 1293 01:17:48,100 --> 01:17:49,560 each of these values are. 1294 01:17:49,560 --> 01:17:54,570 So e of y, just to tell you, would be m over 2. 1295 01:17:54,570 --> 01:17:58,940 You're flipping m coins, fair coin with probability 1/2. 1296 01:17:58,940 --> 01:18:01,075 So you got m over 2. 1297 01:18:01,075 --> 01:18:02,450 And then the last thing that I'll 1298 01:18:02,450 --> 01:18:08,980 tell you is that what you want in terms of invoking that, 1299 01:18:08,980 --> 01:18:12,660 you want r-- remember we were talking about tails here-- so r 1300 01:18:12,660 --> 01:18:18,990 is going to be d log n minus c log n. 1301 01:18:18,990 --> 01:18:23,780 So you just invoke Chernoff with e of y equals m over 2. 1302 01:18:23,780 --> 01:18:27,370 And what you're saying here is you want c log n heads. 1303 01:18:27,370 --> 01:18:34,195 You want to make sure you get c log n heads, which 1304 01:18:34,195 --> 01:18:35,820 means that the number of tails is going 1305 01:18:35,820 --> 01:18:38,390 to be d log n minus c log n. 1306 01:18:38,390 --> 01:18:41,610 And typically we analyze failure probability, 1307 01:18:41,610 --> 01:18:45,180 so what this is is this is going to be a tiny number. 1308 01:18:45,180 --> 01:18:51,350 So the failure is when you get fewer than c log n heads. 1309 01:18:51,350 --> 01:18:54,140 So the failure is when you get fewer than c log n heads. 1310 01:18:54,140 --> 01:18:59,585 And so that means that you're getting more than d log 1311 01:18:59,585 --> 01:19:04,330 n minus c log n tails as you're flipping this coin. 1312 01:19:04,330 --> 01:19:07,910 Fewer than c log n heads means you're getting at least d log 1313 01:19:07,910 --> 01:19:10,030 n minus c log n tails. 1314 01:19:10,030 --> 01:19:12,130 So that's why this is your r here. 1315 01:19:12,130 --> 01:19:14,360 And then when your r gets that large, 1316 01:19:14,360 --> 01:19:16,400 and you can play around with the d and the c 1317 01:19:16,400 --> 01:19:19,580 and choose d equals 8c, you realize 1318 01:19:19,580 --> 01:19:22,640 that this is going to be a minuscule probability. 1319 01:19:22,640 --> 01:19:27,790 And you can turn that around to a polynomial-- 1320 01:19:27,790 --> 01:19:29,180 again, a little bit of algebra. 1321 01:19:29,180 --> 01:19:32,990 But you can show this result on here 1322 01:19:32,990 --> 01:19:34,820 that says that the number of coin 1323 01:19:34,820 --> 01:19:37,745 flips until c log n heads is order log 1324 01:19:37,745 --> 01:19:39,940 n with high probability by appropriately 1325 01:19:39,940 --> 01:19:45,700 choosing the constant d to be some time/number over c. 1326 01:19:45,700 --> 01:19:47,250 So I'll let you do that algebra. 1327 01:19:47,250 --> 01:19:51,100 But this one last thing that-- we're not quite done. 1328 01:19:51,100 --> 01:19:53,880 So you thought we were done, but we're not quite done. 1329 01:19:53,880 --> 01:19:58,220 And why is it that we're not quite done? 1330 01:19:58,220 --> 01:20:02,370 Real quick question worth five Frisbees. 1331 01:20:02,370 --> 01:20:04,620 Why is it that we're not quite done? 1332 01:20:04,620 --> 01:20:05,840 What did I say? 1333 01:20:05,840 --> 01:20:08,733 I have done event A and event B, right? 1334 01:20:08,733 --> 01:20:10,100 AUDIENCE: [INAUDIBLE]. 1335 01:20:10,100 --> 01:20:13,800 SRINIVAS DEVADAS: I haven't done the last thing which 1336 01:20:13,800 --> 01:20:22,070 is to show that probability of event A-- this 1337 01:20:22,070 --> 01:20:25,250 is with high probability happens-- 1338 01:20:25,250 --> 01:20:26,910 and I need to show that probability 1339 01:20:26,910 --> 01:20:33,385 of event A and event B happens-- or this 1340 01:20:33,385 --> 01:20:34,967 is with high probability. 1341 01:20:34,967 --> 01:20:36,550 Or I should just say event A and event 1342 01:20:36,550 --> 01:20:42,510 B happen with high probability. 1343 01:20:42,510 --> 01:20:43,470 And you can see that. 1344 01:20:43,470 --> 01:20:45,480 It turns out it's pretty straightforward, 1345 01:20:45,480 --> 01:20:47,440 but you got the gist of it. 1346 01:20:47,440 --> 01:20:49,020 Thanks for being so patient. 1347 01:20:49,020 --> 01:20:51,870 And there you go guys. 1348 01:20:51,870 --> 01:20:53,420 Woah.