1 00:00:00,050 --> 00:00:01,770 The following content is provided 2 00:00:01,770 --> 00:00:04,010 under a Creative Commons license. 3 00:00:04,010 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue 4 00:00:06,860 --> 00:00:10,720 to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,340 To make a donation or view additional materials 6 00:00:13,340 --> 00:00:17,226 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,226 --> 00:00:17,851 at ocw.mit.edu. 8 00:00:21,570 --> 00:00:23,800 PROFESSOR: So we're going to do rolling caches then, 9 00:00:23,800 --> 00:00:26,880 we're going to go a little bit over amortized analysis 10 00:00:26,880 --> 00:00:28,320 and if we have a lot of time left, 11 00:00:28,320 --> 00:00:32,590 we're going to talk about good and bad hash functions. 12 00:00:32,590 --> 00:00:36,690 So can someone remind me what's the point of rolling hashes? 13 00:00:36,690 --> 00:00:38,260 What's the problem? 14 00:00:38,260 --> 00:00:40,132 What are we trying to solve in lectures? 15 00:00:43,980 --> 00:00:46,055 Be brave. 16 00:00:46,055 --> 00:00:49,380 AUDIENCE: Gets faster, I think, because like-- 17 00:00:49,380 --> 00:00:51,130 PROFESSOR: So what are we trying to solve? 18 00:00:51,130 --> 00:00:53,047 You don't need to go ahead, tell me 19 00:00:53,047 --> 00:00:55,130 what's the big problem that we're trying to solve. 20 00:01:01,505 --> 00:01:02,630 AUDIENCE: I don't remember/ 21 00:01:02,630 --> 00:01:04,510 PROFESSOR: OK, so let's go over that. 22 00:01:04,510 --> 00:01:08,200 So we have a big document, AKA a long string, 23 00:01:08,200 --> 00:01:10,970 and we're trying to find a smaller string inside it. 24 00:01:10,970 --> 00:01:13,210 And we're trying to do that efficiently. 25 00:01:13,210 --> 00:01:23,220 So say the big document is-- you might have seen this before. 26 00:01:23,220 --> 00:01:29,130 And we're trying to look for the here. 27 00:01:29,130 --> 00:01:30,780 How do I do that with rolling hashes? 28 00:01:33,910 --> 00:01:38,140 So the slow, nice solution is I get this the 29 00:01:38,140 --> 00:01:41,439 and then I overlap with the beginning of the document, 30 00:01:41,439 --> 00:01:42,480 I do a string comparison. 31 00:01:42,480 --> 00:01:44,410 If it matches, I say that it's a match. 32 00:01:44,410 --> 00:01:46,860 It's not, I overlap it here. 33 00:01:46,860 --> 00:01:48,860 String match, I overlap it here. 34 00:01:48,860 --> 00:01:50,829 String match, so on and so forth. 35 00:01:50,829 --> 00:01:53,370 The problem is this does a lot of string matching operations, 36 00:01:53,370 --> 00:01:57,290 and the string matching operation is how expensive? 37 00:01:57,290 --> 00:01:58,846 What's the running time? 38 00:01:58,846 --> 00:02:00,161 AUDIENCE: Order n. 39 00:02:00,161 --> 00:02:02,410 PROFESSOR: Order n, where n is the size of the string. 40 00:02:02,410 --> 00:02:06,690 So if we have a string, say this is the key 41 00:02:06,690 --> 00:02:11,590 that we're looking for and n is the document size then 42 00:02:11,590 --> 00:02:15,210 this is going to be order n times k. 43 00:02:15,210 --> 00:02:17,660 We want to get to something better. 44 00:02:17,660 --> 00:02:19,380 So how do I do this with rolling hashes? 45 00:02:27,060 --> 00:02:30,445 AUDIENCE: We take the strings up, 46 00:02:30,445 --> 00:02:35,040 and you come up with a hash code for it. 47 00:02:35,040 --> 00:02:37,720 PROFESSOR: OK so we're going to hash this. 48 00:02:37,720 --> 00:02:39,950 And let's say this is the key hash. 49 00:02:39,950 --> 00:02:41,787 OK, very good. 50 00:02:41,787 --> 00:02:44,589 AUDIENCE: And then once you know that, then you'll 51 00:02:44,589 --> 00:02:46,730 need to compute the next letter hash, 52 00:02:46,730 --> 00:02:49,490 or just add it on to that pairing. 53 00:02:49,490 --> 00:02:51,120 PROFESSOR: OK, so next letter for-- 54 00:02:51,120 --> 00:02:51,745 AUDIENCE: Yeah. 55 00:02:54,190 --> 00:03:01,038 So you compute the hash of the entire string, n, capital n-- 56 00:03:01,038 --> 00:03:02,520 PROFESSOR: Let's not do that. 57 00:03:02,520 --> 00:03:07,852 Let's compute the hash of the first key characters 58 00:03:07,852 --> 00:03:08,435 in the string. 59 00:03:08,435 --> 00:03:12,240 AUDIENCE: Are we separating them by space [? inside? ?] 60 00:03:12,240 --> 00:03:14,862 PROFESSOR: Yeah, so this is going to be a character. 61 00:03:14,862 --> 00:03:15,903 AUDIENCE: The space will? 62 00:03:15,903 --> 00:03:16,612 PROFESSOR: Sorry? 63 00:03:16,612 --> 00:03:18,278 AUDIENCE: The space will be a character? 64 00:03:18,278 --> 00:03:19,070 PROFESSOR: Yeah. 65 00:03:19,070 --> 00:03:20,736 So let's take the first three characters 66 00:03:20,736 --> 00:03:22,870 and compute the hash of that. 67 00:03:22,870 --> 00:03:26,530 And let's call this the our sliding window. 68 00:03:26,530 --> 00:03:30,170 So we're going to say that window has the 69 00:03:30,170 --> 00:03:34,620 and then we're going to compute the hash of the characters 70 00:03:34,620 --> 00:03:36,570 in the window, and we're going to see 71 00:03:36,570 --> 00:03:40,069 that this matches the hash of the key. 72 00:03:40,069 --> 00:03:41,610 And then we'll figure out what to do. 73 00:03:44,360 --> 00:03:46,200 That aside, we're going to slide the window 74 00:03:46,200 --> 00:03:48,460 to the right by one character so take out 75 00:03:48,460 --> 00:03:50,860 key and put in the space. 76 00:03:50,860 --> 00:03:56,260 And now the window has HE space, we're 77 00:03:56,260 --> 00:03:58,560 going to compute the hash of the window, 78 00:03:58,560 --> 00:04:03,400 see that it's not the same as this hash of the key. 79 00:04:03,400 --> 00:04:04,860 What do we know in this case? 80 00:04:11,190 --> 00:04:13,750 Different hashes means-- 81 00:04:13,750 --> 00:04:15,274 AUDIENCE: Not the same string. 82 00:04:15,274 --> 00:04:16,940 PROFESSOR: For sure not the same string. 83 00:04:16,940 --> 00:04:21,660 So this is not the. 84 00:04:21,660 --> 00:04:23,870 OK, now suppose I'm sliding my window 85 00:04:23,870 --> 00:04:26,260 so after this I will slide my window again, 86 00:04:26,260 --> 00:04:28,810 and I would have e space f. 87 00:04:28,810 --> 00:04:30,290 Right, so on and so forth. 88 00:04:30,290 --> 00:04:33,540 Suppose I'm happy sliding my window and then I get here 89 00:04:33,540 --> 00:04:39,570 and I have my window be IN space, and the hash the window 90 00:04:39,570 --> 00:04:43,270 happens to match the hash of the key. 91 00:04:43,270 --> 00:04:45,550 So we're in the same situation as here. 92 00:04:45,550 --> 00:04:47,334 Now what? 93 00:04:47,334 --> 00:04:48,250 AUDIENCE: [INAUDIBLE]. 94 00:04:51,690 --> 00:04:52,860 PROFESSOR: Very good. 95 00:04:52,860 --> 00:04:57,120 We have to check if the string inside the window 96 00:04:57,120 --> 00:04:59,360 is the same as the string inside the key. 97 00:04:59,360 --> 00:05:02,360 And if it is, we found a match. 98 00:05:02,360 --> 00:05:03,955 If it isn't, we keep working. 99 00:05:03,955 --> 00:05:04,527 All right? 100 00:05:04,527 --> 00:05:06,110 And we have to do the same thing here. 101 00:05:12,340 --> 00:05:16,985 So this is our string matching algorithm. 102 00:05:16,985 --> 00:05:21,815 AUDIENCE: Can we somehow make sure that we make a hash 103 00:05:21,815 --> 00:05:26,650 function such that it will never [INAUDIBLE]-- 104 00:05:26,650 --> 00:05:27,910 PROFESSOR: Excellent question. 105 00:05:27,910 --> 00:05:30,017 Thank you, I like that. 106 00:05:30,017 --> 00:05:31,600 Can we make a hash function so that we 107 00:05:31,600 --> 00:05:34,110 don't have any false positives, right? 108 00:05:34,110 --> 00:05:35,320 Let's see. 109 00:05:35,320 --> 00:05:36,640 How do hash functions work? 110 00:05:36,640 --> 00:05:38,299 What's the argument to a hash function 111 00:05:38,299 --> 00:05:39,465 and what's the return value? 112 00:05:43,586 --> 00:05:45,044 AUDIENCE: The argument is something 113 00:05:45,044 --> 00:05:45,992 that you want to hash. 114 00:05:49,859 --> 00:05:52,400 PROFESSOR: So in this case we're working with three character 115 00:05:52,400 --> 00:05:52,899 strings. 116 00:05:52,899 --> 00:05:56,270 But let's say we're looking for a one megabyte 117 00:05:56,270 --> 00:05:58,970 string inside the one gigabyte string. 118 00:05:58,970 --> 00:06:02,430 Say we're a music company and we're 119 00:06:02,430 --> 00:06:07,370 looking for our mp3 file inside the big files 120 00:06:07,370 --> 00:06:09,550 off a pirate server or something. 121 00:06:09,550 --> 00:06:14,219 So this is 1 million character strings, 122 00:06:14,219 --> 00:06:15,510 because that's the window size. 123 00:06:19,370 --> 00:06:20,910 And it's going to return what? 124 00:06:25,830 --> 00:06:30,500 What do hash functions return for them to be useful? 125 00:06:30,500 --> 00:06:31,070 Integers. 126 00:06:31,070 --> 00:06:32,580 Nice small integers, right? 127 00:06:32,580 --> 00:06:35,500 Ideally, the integer would fit in a word size, where 128 00:06:35,500 --> 00:06:39,650 the word is the register size on our computer. 129 00:06:39,650 --> 00:06:41,700 What are popular word sizes? 130 00:06:41,700 --> 00:06:42,796 Does anyone know? 131 00:06:42,796 --> 00:06:43,712 AUDIENCE: [INAUDIBLE]. 132 00:06:45,897 --> 00:06:46,730 AUDIENCE: Excellent. 133 00:06:46,730 --> 00:06:55,350 32-bits, 64-bits integers. 134 00:06:55,350 --> 00:06:57,590 OK, so what's the universe size for this function? 135 00:06:57,590 --> 00:06:59,972 How many one million character strings are there? 136 00:07:03,534 --> 00:07:05,200 AUDIENCE: How many characters are there? 137 00:07:05,200 --> 00:07:05,760 PROFESSOR: Excellent. 138 00:07:05,760 --> 00:07:07,500 Let's say we're old school and we're doing [? SG ?]. 139 00:07:07,500 --> 00:07:09,455 We don't care about the rest of the world. 140 00:07:09,455 --> 00:07:12,640 AUDIENCE: 256 characters? 141 00:07:12,640 --> 00:07:13,452 PROFESSOR: OK. 142 00:07:13,452 --> 00:07:14,993 AUDIENCE: To the one millionth power. 143 00:07:20,474 --> 00:07:21,140 PROFESSOR: Cool. 144 00:07:21,140 --> 00:07:23,370 Let's say we're working on an old school computer, 145 00:07:23,370 --> 00:07:25,620 since we're old school and we have a 32-bit word size. 146 00:07:25,620 --> 00:07:28,695 How many possible values for the hash function? 147 00:07:28,695 --> 00:07:30,180 AUDIENCE: 2 to the 32. 148 00:07:38,100 --> 00:07:39,940 PROFESSOR: The [? other ?] is bigger. 149 00:07:39,940 --> 00:07:41,440 You're messing with me, right? 150 00:07:41,440 --> 00:07:45,540 OK so this is 2 to the 8th. 151 00:07:45,540 --> 00:07:47,115 AUDIENCE: 2 to the 8 million, right? 152 00:07:47,115 --> 00:07:47,740 PROFESSOR: Yup. 153 00:07:47,740 --> 00:07:50,700 So this is a lot bigger than this. 154 00:07:50,700 --> 00:07:54,060 So if we want to make a hash function that gives us 155 00:07:54,060 --> 00:07:56,210 no false positives, then we'd have 156 00:07:56,210 --> 00:07:59,540 to be able to-- if we have the universe of possible inputs 157 00:07:59,540 --> 00:08:01,550 and the universe of possible outputs, 158 00:08:01,550 --> 00:08:04,910 draw a line from every input to a different output. 159 00:08:04,910 --> 00:08:09,330 But if we have-- 2 to the 32 by the way is about four billion. 160 00:08:09,330 --> 00:08:14,897 If we have four billion outputs and a lot of inputs 161 00:08:14,897 --> 00:08:16,480 as we draw our lines, we're eventually 162 00:08:16,480 --> 00:08:18,021 going to run out of outputs and we're 163 00:08:18,021 --> 00:08:20,690 going to have to use the same output again and again. 164 00:08:20,690 --> 00:08:23,460 So for a hash function, pretty much always 165 00:08:23,460 --> 00:08:26,170 the universe size is bigger than the output size. 166 00:08:26,170 --> 00:08:29,490 So hash functions will always have collisions. 167 00:08:29,490 --> 00:08:31,610 So a collision for a hash function 168 00:08:31,610 --> 00:08:40,220 is two inputs, x1, x2, so that's x1 is not x2, but h of x1 169 00:08:40,220 --> 00:08:41,620 equals h of x2. 170 00:08:46,866 --> 00:08:47,990 So this will always happen. 171 00:08:47,990 --> 00:08:50,160 There's no way around it. 172 00:08:50,160 --> 00:08:53,260 What we're hoping for is that the collisions 173 00:08:53,260 --> 00:08:55,330 aren't something dumb. 174 00:08:55,330 --> 00:08:57,790 So we're hoping that the hash function acts 175 00:08:57,790 --> 00:09:02,370 in a reasonably randomly and we talked about ideal hash 176 00:09:02,370 --> 00:09:06,940 functions that would pretty much look like they would get 177 00:09:06,940 --> 00:09:10,209 a random output for every input. 178 00:09:10,209 --> 00:09:12,250 And we're not going to worry too much about that. 179 00:09:12,250 --> 00:09:15,230 What matters is that as long as the hash function is reasonably 180 00:09:15,230 --> 00:09:19,470 good, we're not going to have too many false positives. 181 00:09:19,470 --> 00:09:28,080 So say the output set is O, so O is 2 to the 32. 182 00:09:28,080 --> 00:09:31,830 Then we're hoping to have false positives about one 183 00:09:31,830 --> 00:09:37,480 every O times. 184 00:09:37,480 --> 00:09:40,390 So 1 out of 2 to the 32 false positives. 185 00:09:43,360 --> 00:09:48,720 So what's the running time for when you slide the window 186 00:09:48,720 --> 00:09:50,810 and we're doing this logic here. 187 00:09:50,810 --> 00:09:55,257 What's the running time if the hashes aren't the same? 188 00:09:55,257 --> 00:09:57,590 AUDIENCE: What's the running time of the hash function-- 189 00:09:57,590 --> 00:09:59,822 PROFESSOR: Of the whole matching algorithm. 190 00:09:59,822 --> 00:10:02,683 AUDIENCE: No, no no, of the hash function itself. 191 00:10:02,683 --> 00:10:05,930 Can we make any assumptions about that? 192 00:10:05,930 --> 00:10:07,270 Very good. 193 00:10:07,270 --> 00:10:10,170 What's the running time of the hash function? 194 00:10:10,170 --> 00:10:11,861 So we're going to have to implement-- 195 00:10:11,861 --> 00:10:13,610 if we implement the hash function naively, 196 00:10:13,610 --> 00:10:17,430 then the running time for hashing a key character string 197 00:10:17,430 --> 00:10:19,330 is order key. 198 00:10:19,330 --> 00:10:22,540 But we're going to come up with magic way of doing it 199 00:10:22,540 --> 00:10:24,740 in order one time. 200 00:10:24,740 --> 00:10:28,420 So assume hashing is order 1. 201 00:10:28,420 --> 00:10:30,500 What's the running time for everything else? 202 00:10:30,500 --> 00:10:34,472 So if the hashes don't match, we know it's not a candidate. 203 00:10:34,472 --> 00:10:35,680 So we're going to keep going. 204 00:10:35,680 --> 00:10:39,400 So this is order 1. 205 00:10:39,400 --> 00:10:41,820 What if the hashes do match? 206 00:10:41,820 --> 00:10:44,700 AUDIENCE: [INAUDIBLE] characters. 207 00:10:44,700 --> 00:10:45,450 PROFESSOR: Order-- 208 00:10:45,450 --> 00:10:47,334 AUDIENCE: I mean, but it depends on how many ones match, 209 00:10:47,334 --> 00:10:48,242 but it will be-- 210 00:10:48,242 --> 00:10:49,700 PROFESSOR: So for one match, what's 211 00:10:49,700 --> 00:10:52,118 the running time for one match? 212 00:10:52,118 --> 00:10:54,980 AUDIENCE: Order k-- 213 00:10:54,980 --> 00:10:56,840 PROFESSOR: Order k. 214 00:10:56,840 --> 00:10:57,650 Excellent. 215 00:10:57,650 --> 00:11:03,090 So the total running time is the number of matches times 216 00:11:03,090 --> 00:11:07,430 order k plus the number of non matches times order 1. 217 00:11:07,430 --> 00:11:10,120 So as long as the number of false positives 218 00:11:10,120 --> 00:11:14,240 here is really tiny, the math is going 219 00:11:14,240 --> 00:11:17,055 to come out to be roughly order 1 per character. 220 00:11:20,589 --> 00:11:22,884 AUDIENCE: So the whole thing is order in. 221 00:11:22,884 --> 00:11:24,800 PROFESSOR: Everything should be order n, yeah, 222 00:11:24,800 --> 00:11:26,008 that's what we're hoping for. 223 00:11:32,070 --> 00:11:34,866 OK so let's talk about the magic because you asked me 224 00:11:34,866 --> 00:11:36,740 what's the running time for the hash function 225 00:11:36,740 --> 00:11:38,650 and this is the interesting part. 226 00:11:38,650 --> 00:11:41,110 How do I get to compute these hashes 227 00:11:41,110 --> 00:11:43,600 and order 1 instead of order k? 228 00:11:43,600 --> 00:11:47,120 We have this data structure called rolling hash. 229 00:11:52,680 --> 00:11:54,860 So rolling hash-- does anyone remember 230 00:11:54,860 --> 00:11:56,730 from lecture what it is? 231 00:11:56,730 --> 00:11:58,730 AUDIENCE: Isn't that what we're doing right now? 232 00:12:01,560 --> 00:12:04,390 PROFESSOR: So this is a sliding window. 233 00:12:04,390 --> 00:12:08,450 And the data structure will compute fast hashes 234 00:12:08,450 --> 00:12:10,700 for the strings inside the sliding window. 235 00:12:10,700 --> 00:12:12,910 So how does it work? 236 00:12:12,910 --> 00:12:16,390 I mean not how does it work functionally, what 237 00:12:16,390 --> 00:12:19,855 are the operations for a rolling hash, let's try that. 238 00:12:19,855 --> 00:12:23,547 AUDIENCE: Oh [INAUDIBLE]. 239 00:12:23,547 --> 00:12:25,130 PROFESSOR: OK, so we have two updates. 240 00:12:25,130 --> 00:12:26,250 One of them is pop. 241 00:12:26,250 --> 00:12:28,730 For some reason, our notes call it skip, 242 00:12:28,730 --> 00:12:30,900 but I like pop better, so I'm going 243 00:12:30,900 --> 00:12:34,360 to write skip and think pop. 244 00:12:34,360 --> 00:12:35,856 And the other one is? 245 00:12:35,856 --> 00:12:37,720 AUDIENCE: Always [INAUDIBLE]. 246 00:12:41,460 --> 00:12:45,200 PROFESSOR: A pen with a new character, OK? 247 00:12:45,200 --> 00:12:45,820 Cool. 248 00:12:45,820 --> 00:12:46,690 So these are the updates. 249 00:12:46,690 --> 00:12:48,273 Now what's the point of those updates? 250 00:12:48,273 --> 00:12:50,762 What's the query for a rolling hash? 251 00:12:50,762 --> 00:12:52,730 AUDIENCE: [INAUDIBLE]. 252 00:12:52,730 --> 00:12:54,910 You just grab the next character, 253 00:12:54,910 --> 00:12:57,580 append that, and then skip [INAUDIBLE]. 254 00:12:57,580 --> 00:13:00,840 PROFESSOR: OK, so this is how I update the rolling hash 255 00:13:00,840 --> 00:13:04,390 to contain to reflect the contents of my sliding window. 256 00:13:04,390 --> 00:13:05,480 And what I do after that? 257 00:13:05,480 --> 00:13:08,600 What's the reason for that? 258 00:13:08,600 --> 00:13:10,434 AUDIENCE: You skip your [INAUDIBLE]. 259 00:13:10,434 --> 00:13:12,100 PROFESSOR: So don't think too hard, it's 260 00:13:12,100 --> 00:13:13,790 a really easy question. 261 00:13:13,790 --> 00:13:15,890 I moved a sliding window here. 262 00:13:15,890 --> 00:13:17,599 What do I want to get? 263 00:13:17,599 --> 00:13:19,890 AUDIENCE: You want to get the hash of those characters. 264 00:13:19,890 --> 00:13:23,810 PROFESSOR: The hash of those characters, very good. 265 00:13:23,810 --> 00:13:25,630 So this is the query. 266 00:13:25,630 --> 00:13:29,410 So a rolling hash has a sequence of characters in it, right? 267 00:13:29,410 --> 00:13:30,890 Say t, h, e. 268 00:13:35,210 --> 00:13:42,250 And it allows us to append the character and pop a character. 269 00:13:42,250 --> 00:13:46,580 Append a character, pop a character. 270 00:13:46,580 --> 00:13:48,390 And then it promises that it's going 271 00:13:48,390 --> 00:13:50,920 to compute the hash of whatever's inside 272 00:13:50,920 --> 00:13:52,930 the rolling hash really fast. 273 00:13:56,370 --> 00:13:58,475 Append goes here, skip goes here. 274 00:14:01,500 --> 00:14:02,912 How fast do these operations need 275 00:14:02,912 --> 00:14:04,620 to be for my algorithm to work correctly? 276 00:14:08,612 --> 00:14:10,874 AUDIENCE: Order 1. 277 00:14:10,874 --> 00:14:12,540 PROFESSOR: I promised you that computing 278 00:14:12,540 --> 00:14:14,290 hash there is order 1, right? 279 00:14:14,290 --> 00:14:19,590 So I have to-- OK. 280 00:14:19,590 --> 00:14:21,880 Let's see how we're going to make this happen. 281 00:14:21,880 --> 00:14:22,910 So these are letters. 282 00:14:22,910 --> 00:14:24,870 These make sense when we're trying 283 00:14:24,870 --> 00:14:27,420 to understand string matching. 284 00:14:27,420 --> 00:14:30,360 But now we're going to switch the numbers, because after all, 285 00:14:30,360 --> 00:14:32,250 strings are sequences of characters, 286 00:14:32,250 --> 00:14:33,820 and characters are numbers. 287 00:14:33,820 --> 00:14:36,540 And because I know how to do math on numbers, 288 00:14:36,540 --> 00:14:38,760 I don't know how to do math on characters. 289 00:14:38,760 --> 00:14:41,640 So let's use this list. 290 00:14:41,640 --> 00:14:44,080 Let's say that instead of having numbers in base 256 291 00:14:44,080 --> 00:14:45,500 which is [INAUDIBLE], we're going 292 00:14:45,500 --> 00:14:49,020 to have numbers in base 100, because it's really easy to do 293 00:14:49,020 --> 00:14:53,050 operations in base 100 on paper. 294 00:14:53,050 --> 00:15:09,500 So 3, 14, 15, 92, 55, 35, 89, 79, 31. 295 00:15:09,500 --> 00:15:12,270 So these are all base 100 numbers. 296 00:15:12,270 --> 00:15:16,280 And say my rolling window is size 5. 297 00:15:16,280 --> 00:15:17,825 One, two, three, four, five. 298 00:15:21,980 --> 00:15:27,010 So I want to come up with a way so that I have the hash of this 299 00:15:27,010 --> 00:15:32,690 and then when I slide my window, I will get the hash of this. 300 00:15:35,280 --> 00:15:39,370 What hashing method do we use for rolling hashes? 301 00:15:39,370 --> 00:15:41,134 Does anyone remember? 302 00:15:41,134 --> 00:15:43,022 AUDIENCE: [INAUDIBLE]. 303 00:15:43,022 --> 00:15:47,590 PROFESSOR: Mod, you said-- I heard mod something. 304 00:15:47,590 --> 00:15:49,060 AUDIENCE: Yeah, that's what I said. 305 00:15:49,060 --> 00:15:51,020 PROFESSOR: OK, so? 306 00:15:51,020 --> 00:15:51,970 So? 307 00:15:51,970 --> 00:15:55,740 So the hash is? 308 00:15:55,740 --> 00:15:58,010 The hash of a key is? 309 00:15:58,010 --> 00:16:00,304 AUDIENCE: It's k mod m or m k. 310 00:16:00,304 --> 00:16:01,060 [INAUDIBLE] 311 00:16:01,060 --> 00:16:01,643 PROFESSOR: OK. 312 00:16:09,240 --> 00:16:10,760 I'm going to say k mod something, 313 00:16:10,760 --> 00:16:12,720 and I'm going to say that something 314 00:16:12,720 --> 00:16:17,650 has to be a prime number and we'll see why in a bit. 315 00:16:17,650 --> 00:16:19,500 Let's say our prime number is 23. 316 00:16:22,850 --> 00:16:27,540 So let's compute the value of the hash for the sliding window 317 00:16:27,540 --> 00:16:29,290 of the first sliding window and then we'll 318 00:16:29,290 --> 00:16:32,880 compute the hash for the second sliding window. 319 00:16:32,880 --> 00:16:35,900 Oh, there is some at the computer, sweet. 320 00:16:35,900 --> 00:16:52,354 314159265 modulo 23 is how much? 321 00:16:52,354 --> 00:16:54,020 OK, while you're doing that, can someone 322 00:16:54,020 --> 00:16:55,610 tell me what computation will he need 323 00:16:55,610 --> 00:16:59,199 to do for the second sliding window? 324 00:16:59,199 --> 00:17:03,840 AUDIENCE: 1519265359. 325 00:17:03,840 --> 00:17:07,920 PROFESSOR: 159265359. 326 00:17:07,920 --> 00:17:09,405 AUDIENCE: That's a third sign. 327 00:17:09,405 --> 00:17:11,385 AUDIENCE: There's a 1-4 before that. 328 00:17:11,385 --> 00:17:12,870 AUDIENCE: The first one is 11. 329 00:17:16,349 --> 00:17:18,510 PROFESSOR: OK. 330 00:17:18,510 --> 00:17:19,593 And what's the second one? 331 00:17:19,593 --> 00:17:25,530 AUDIENCE: [INAUDIBLE] adding-- 332 00:17:25,530 --> 00:17:30,069 PROFESSOR: 1415926335 modulo 23. 333 00:17:30,069 --> 00:17:30,569 AUDIENCE: 5. 334 00:17:34,055 --> 00:17:36,400 PROFESSOR: I heard a 5 and a 7. 335 00:17:36,400 --> 00:17:37,115 OK. 336 00:17:37,115 --> 00:17:38,240 AUDIENCE: Hold on, hold on. 337 00:17:41,615 --> 00:17:43,490 PROFESSOR: I'll take the average of those two 338 00:17:43,490 --> 00:17:44,862 and we can move on, right? 339 00:17:44,862 --> 00:17:49,290 AUDIENCE: Three five, and arguably 6. 340 00:17:49,290 --> 00:17:53,140 PROFESSOR: All right, so let's implement 341 00:17:53,140 --> 00:17:54,405 an operation called slide. 342 00:17:58,860 --> 00:18:03,060 And slide will take the new number that I'm sliding in. 343 00:18:05,630 --> 00:18:08,010 And the old number that I'm sliding out, 344 00:18:08,010 --> 00:18:11,260 making my life really easy. 345 00:18:11,260 --> 00:18:15,625 So in this case, the numbers would be-- 346 00:18:15,625 --> 00:18:18,880 AUDIENCE: The new one is 35. 347 00:18:18,880 --> 00:18:20,695 PROFESSOR: And the old one? 348 00:18:20,695 --> 00:18:21,195 AUDIENCE: 3. 349 00:18:26,125 --> 00:18:27,000 PROFESSOR: Excellent. 350 00:18:27,000 --> 00:18:34,940 And I want to have an internal state called hash that has 11 351 00:18:34,940 --> 00:18:40,640 and I want to get 6 after I'm done running slide. 352 00:18:40,640 --> 00:18:43,260 This is still too hard for me, so before we figure out hash, 353 00:18:43,260 --> 00:18:47,320 let's say that we have an internal state called n. 354 00:18:47,320 --> 00:18:51,820 And n is this big number here. 355 00:18:51,820 --> 00:18:54,970 So I want to get from this big number to this big number. 356 00:18:54,970 --> 00:18:55,960 What am I going to do? 357 00:18:58,852 --> 00:19:06,580 AUDIENCE: Mod 3,000 [INAUDIBLE]. 358 00:19:06,580 --> 00:19:16,140 PROFESSOR: OK, so you want to take the big number 159265 359 00:19:16,140 --> 00:19:19,660 and mod it. 360 00:19:19,660 --> 00:19:20,535 AUDIENCE: [INAUDIBLE] 361 00:19:27,206 --> 00:19:29,080 PROFESSOR: So if I mod it to by a big number, 362 00:19:29,080 --> 00:19:30,570 that's going to be too slow. 363 00:19:30,570 --> 00:19:31,860 So I can't mod it. 364 00:19:31,860 --> 00:19:34,645 AUDIENCE: Can't you just divide it? 365 00:19:34,645 --> 00:19:36,020 PROFESSOR: Division is also slow, 366 00:19:36,020 --> 00:19:37,103 I don't like the division. 367 00:19:37,103 --> 00:19:39,670 I like subtraction, someone said subtraction. 368 00:19:39,670 --> 00:19:44,320 So what I want to do is I want to get from here to a number 369 00:19:44,320 --> 00:19:45,930 that-- to this number, right? 370 00:19:45,930 --> 00:19:47,520 So I want to get rid of the 3 and I 371 00:19:47,520 --> 00:19:50,410 want to add 35 at the end. 372 00:19:50,410 --> 00:19:53,600 To get rid of the 3, what do I subtract? 373 00:19:53,600 --> 00:19:54,947 AUDIENCE: 3 with a bunch of 0s. 374 00:19:54,947 --> 00:19:56,280 PROFESSOR: 3 with a bunch of 0s. 375 00:19:56,280 --> 00:19:56,780 Excellent. 376 00:19:56,780 --> 00:19:59,880 1, 2, 3, 4, 5, 6, 7, 8. 377 00:19:59,880 --> 00:20:01,565 How many of them are there? 378 00:20:01,565 --> 00:20:02,990 AUDIENCE: 8. 379 00:20:02,990 --> 00:20:05,802 PROFESSOR: OK, how many digits conveys 100? 380 00:20:05,802 --> 00:20:09,310 AUDIENCE: Oh, 2 right? 381 00:20:09,310 --> 00:20:10,840 AUDIENCE: 4. 382 00:20:10,840 --> 00:20:12,550 AUDIENCE: Oh, oh. 383 00:20:12,550 --> 00:20:14,502 PROFESSOR: 4. 384 00:20:14,502 --> 00:20:16,435 So base 100, so two numbers-- 385 00:20:16,435 --> 00:20:18,971 AUDIENCE: One base 100 number is two digits. 386 00:20:18,971 --> 00:20:19,470 Yep. 387 00:20:19,470 --> 00:20:20,820 PROFESSOR: So 8. 388 00:20:20,820 --> 00:20:22,810 yeah, OK, 4. 389 00:20:22,810 --> 00:20:25,000 Cool. 390 00:20:25,000 --> 00:20:29,940 So let's try to write this in a more abstract way. 391 00:20:29,940 --> 00:20:38,160 So n is the old n minus old, right, 392 00:20:38,160 --> 00:20:43,920 so that 3 is old times what do I have to multiply it by 393 00:20:43,920 --> 00:20:48,844 to get all those zeros? 394 00:20:48,844 --> 00:20:52,620 k minus 1? 395 00:20:52,620 --> 00:20:55,710 [INAUDIBLE] to that base whatever. 396 00:20:55,710 --> 00:21:01,980 PROFESSOR: OK, so-- base to the size to something. 397 00:21:01,980 --> 00:21:04,420 K minus 1. 398 00:21:04,420 --> 00:21:05,940 So K is 5 in this case, right? 399 00:21:05,940 --> 00:21:07,750 My window is 5. 400 00:21:07,750 --> 00:21:10,740 And I see a 4 there, so I'm going to add the minus 1 401 00:21:10,740 --> 00:21:13,620 just because that's what I need to do. 402 00:21:13,620 --> 00:21:19,030 OK, so then I get 14159265. 403 00:21:19,030 --> 00:21:23,488 What do I do to tack on a 35 at the end? 404 00:21:23,488 --> 00:21:25,204 AUDIENCE: [INAUDIBLE] 35. 405 00:21:25,204 --> 00:21:26,870 PROFESSOR: OK, times the base, so that's 406 00:21:26,870 --> 00:21:28,720 going to give me the zeroes. 407 00:21:28,720 --> 00:21:31,140 And then this is a minus here. 408 00:21:31,140 --> 00:21:33,910 And then I'm going to add 35. 409 00:21:33,910 --> 00:21:35,570 Right? 410 00:21:35,570 --> 00:21:40,420 1415926535. 411 00:21:40,420 --> 00:21:42,320 Look, it's right. 412 00:21:42,320 --> 00:21:45,770 So what do I write here? 413 00:21:45,770 --> 00:21:48,245 AUDIENCE: The base first. 414 00:21:48,245 --> 00:21:49,235 PROFESSOR: Good point. 415 00:21:56,011 --> 00:21:56,510 OK. 416 00:21:59,317 --> 00:22:01,650 Let me play with this a little bit before we go further. 417 00:22:04,317 --> 00:22:05,900 I'm going to distribute the base here. 418 00:22:05,900 --> 00:22:12,310 So this is n times base minus old times 419 00:22:12,310 --> 00:22:17,340 base to the k plus mu. 420 00:22:17,340 --> 00:22:20,910 And let's rename base to size to be the size of the window, 421 00:22:20,910 --> 00:22:23,850 I don't like k. 422 00:22:23,850 --> 00:22:26,060 And I'm renaming it because later on we're 423 00:22:26,060 --> 00:22:30,560 going to break our slide into appends and skip 424 00:22:30,560 --> 00:22:32,400 and the size won't be constant anymore. 425 00:22:35,550 --> 00:22:37,750 OK so does this make sense? 426 00:22:37,750 --> 00:22:38,360 It's all math. 427 00:22:38,360 --> 00:22:41,330 So this math here becomes abstract math here. 428 00:22:41,330 --> 00:22:42,380 But nothing else changes. 429 00:22:47,610 --> 00:22:51,380 OK, so now I want to get hash-- I 430 00:22:51,380 --> 00:22:55,960 want to get hash out of n, how do I do that? 431 00:22:55,960 --> 00:22:58,280 AUDIENCE: Mod 23. 432 00:22:58,280 --> 00:22:59,880 PROFESSOR: Mod 23, very good. 433 00:22:59,880 --> 00:23:04,130 So in a general way, I would say mod p. 434 00:23:07,040 --> 00:23:18,450 OK so hash is n times base minus old times 435 00:23:18,450 --> 00:23:26,227 base to the size plus new mod p. 436 00:23:26,227 --> 00:23:27,310 Now let's distribute this. 437 00:23:27,310 --> 00:23:29,490 I know I can distribute modulo across addition 438 00:23:29,490 --> 00:23:42,790 and subtraction, so I have n mod p times base minus old times 439 00:23:42,790 --> 00:23:48,550 base to the size mod p plus new. 440 00:23:48,550 --> 00:23:50,537 And everything still has to be a mod p. 441 00:23:53,670 --> 00:24:00,268 So can someone tell me where did I add the mod p? 442 00:24:03,200 --> 00:24:06,005 Why did I put it here and here? 443 00:24:10,927 --> 00:24:14,380 AUDIENCE: [INAUDIBLE] the original? 444 00:24:14,380 --> 00:24:17,425 PROFESSOR: OK, nmodp is hash, let's do that. 445 00:24:23,150 --> 00:24:27,596 So what's true about both n and base to the size? 446 00:24:27,596 --> 00:24:29,370 AUDIENCE: Constant. 447 00:24:29,370 --> 00:24:31,952 PROFESSOR: Constant? 448 00:24:31,952 --> 00:24:34,850 AUDIENCE: Like can you please repeat it? 449 00:24:34,850 --> 00:24:37,442 AUDIENCE: You could [INAUDIBLE] base to the size but you 450 00:24:37,442 --> 00:24:39,600 can't [INAUDIBLE] hash, I mean [INAUDIBLE]-- 451 00:24:39,600 --> 00:24:39,770 PROFESSOR: Hm. 452 00:24:39,770 --> 00:24:41,853 OK, so keep this in mind that we can compute this, 453 00:24:41,853 --> 00:24:43,730 because we're going to want to do that later. 454 00:24:43,730 --> 00:24:45,990 But what I had in mind is the opposite of constant, 455 00:24:45,990 --> 00:24:48,130 because n is huge. 456 00:24:48,130 --> 00:24:48,890 Right? 457 00:24:48,890 --> 00:24:53,570 And base to the size is also huge, right? 458 00:24:53,570 --> 00:24:57,080 N is this number. 459 00:24:57,080 --> 00:24:59,670 Base to the size is this number here. 460 00:24:59,670 --> 00:25:02,910 1 followed by this many zeros, so these numbers are big. 461 00:25:02,910 --> 00:25:05,060 All the other numbers are small. 462 00:25:05,060 --> 00:25:10,285 Base is small, old is small, new is small, p is small. 463 00:25:10,285 --> 00:25:12,410 PROFESSOR: So I want to get rid of the big numbers, 464 00:25:12,410 --> 00:25:16,370 because math with big numbers is slow. 465 00:25:16,370 --> 00:25:18,270 So unless I get rid of the big numbers, 466 00:25:18,270 --> 00:25:21,390 I'm not going to get to order 1 operation. 467 00:25:21,390 --> 00:25:24,140 So we already got rid of this one because it's hash 468 00:25:24,140 --> 00:25:26,515 and how do I get rid of this one? 469 00:25:26,515 --> 00:25:27,390 AUDIENCE: [INAUDIBLE] 470 00:25:34,242 --> 00:25:36,720 AUDIENCE: There's some 6042 algorithm 471 00:25:36,720 --> 00:25:37,741 that does that quickly. 472 00:25:37,741 --> 00:25:40,157 AUDIENCE: Well, we definitely just went over this in class 473 00:25:40,157 --> 00:25:41,465 today. 474 00:25:41,465 --> 00:25:45,706 AUDIENCE: Which is why you needed the prime number, right? 475 00:25:45,706 --> 00:25:46,580 PROFESSOR: Not quite. 476 00:25:46,580 --> 00:25:48,420 There is an algorithm that does it quickly. 477 00:25:48,420 --> 00:25:50,810 That algorithm is called repeated squaring 478 00:25:50,810 --> 00:25:55,640 and the quickest-- wait, I'm not done, I promise I'm not done. 479 00:25:55,640 --> 00:25:58,110 So the quickest that this guy can 480 00:25:58,110 --> 00:26:05,520 run if you do everything right is order of [? log ?] size. 481 00:26:05,520 --> 00:26:07,970 If the window size is 1 megabyte, 482 00:26:07,970 --> 00:26:11,130 10 megabytes, if the window size keeps growing, 483 00:26:11,130 --> 00:26:13,650 if the window size is part of the input size, 484 00:26:13,650 --> 00:26:15,530 is this constant? 485 00:26:15,530 --> 00:26:16,090 Nope. 486 00:26:16,090 --> 00:26:17,610 So I can't do that. 487 00:26:17,610 --> 00:26:19,850 Someone else gave me the right answer before. 488 00:26:24,750 --> 00:26:26,649 What did you say before? 489 00:26:26,649 --> 00:26:27,690 AUDIENCE: Pre-compute it? 490 00:26:27,690 --> 00:26:28,280 PROFESSOR: OK. 491 00:26:28,280 --> 00:26:30,700 It's a constant, so why don't we pre-compute it? 492 00:26:30,700 --> 00:26:35,070 Take it out of here, compute it once, 493 00:26:35,070 --> 00:26:37,730 and after that, we can use it all the time. 494 00:26:37,730 --> 00:26:40,560 And unless someone has a better name for it, 495 00:26:40,560 --> 00:26:41,835 I'm going to call this magic. 496 00:26:44,679 --> 00:26:46,220 The name has to be short, by the way, 497 00:26:46,220 --> 00:26:47,928 because I'll be writing this a few times. 498 00:26:50,250 --> 00:26:53,520 OK, so now we have hash equals hash times base 499 00:26:53,520 --> 00:26:56,854 minus old times magic plus new modulo p. 500 00:26:56,854 --> 00:26:58,020 Doesn't look too bad, right? 501 00:26:58,020 --> 00:26:59,870 Pretty constant time. 502 00:26:59,870 --> 00:27:02,720 Now let's write the pseudo code for the rolling hash, 503 00:27:02,720 --> 00:27:05,980 and let's break this out into an append 504 00:27:05,980 --> 00:27:08,362 and a skip at the same time. 505 00:27:08,362 --> 00:27:12,891 AUDIENCE: What if hash is bigger than your word size? 506 00:27:12,891 --> 00:27:15,390 PROFESSOR: So hash is always going to be something modulo p. 507 00:27:15,390 --> 00:27:16,870 AUDIENCE: Oh that's true, OK. 508 00:27:16,870 --> 00:27:18,890 PROFESSOR: So as long as p is decent, 509 00:27:18,890 --> 00:27:20,217 it's not going to get too big. 510 00:27:20,217 --> 00:27:21,050 AUDIENCE: All right. 511 00:27:21,050 --> 00:27:25,065 What if old and new [INAUDIBLE]-- 512 00:27:25,065 --> 00:27:26,190 PROFESSOR: So old and new-- 513 00:27:26,190 --> 00:27:27,672 AUDIENCE: P is a big number . 514 00:27:27,672 --> 00:27:32,494 314159269 is possibly bigger than your word size, right? 515 00:27:32,494 --> 00:27:33,410 PROFESSOR: Definitely. 516 00:27:33,410 --> 00:27:37,025 So that's why we're getting rid of it. 517 00:27:37,025 --> 00:27:39,000 AUDIENCE: That is true. [INAUDIBLE] 518 00:27:39,000 --> 00:27:42,810 PROFESSOR: So this is k digits in base b. 519 00:27:42,810 --> 00:27:43,390 Too much. 520 00:27:43,390 --> 00:27:44,880 Not going to deal with it. 521 00:27:44,880 --> 00:27:49,610 Hash is one digit in base p, because we're doing it mod p. 522 00:27:49,610 --> 00:27:53,270 Old and new are one digit base b. 523 00:27:53,270 --> 00:27:56,170 So hopefully small numbers. 524 00:27:56,170 --> 00:27:58,240 OK, I haven't seen a constructor in CLRS, 525 00:27:58,240 --> 00:28:00,590 so I'm going to say that when you write pseudocode, 526 00:28:00,590 --> 00:28:02,330 the method name for a constructor 527 00:28:02,330 --> 00:28:05,099 is in it because we've seen this before. 528 00:28:05,099 --> 00:28:07,140 And let's say our constructor for a rolling cache 529 00:28:07,140 --> 00:28:10,680 starts with the base that we're going to use. 530 00:28:10,680 --> 00:28:13,209 And it builds an empty rolling hash, 531 00:28:13,209 --> 00:28:14,500 so first there's nothing in it. 532 00:28:14,500 --> 00:28:16,823 And then you append and you skip and you can get the hash. 533 00:28:16,823 --> 00:28:17,781 AUDIENCE: What about p? 534 00:28:17,781 --> 00:28:18,860 Shouldn't you also do p? 535 00:28:18,860 --> 00:28:20,684 PROFESSOR: Sure. 536 00:28:20,684 --> 00:28:22,680 Do that. 537 00:28:22,680 --> 00:28:29,100 So let's say base and p are set, so somethings sets base and p. 538 00:28:29,100 --> 00:28:30,970 And we need to compute the initial values 539 00:28:30,970 --> 00:28:31,970 for hash and magic. 540 00:28:38,050 --> 00:28:40,290 What's hash? 541 00:28:40,290 --> 00:28:40,930 Zero. 542 00:28:40,930 --> 00:28:42,263 There's nothing in there, right? 543 00:28:42,263 --> 00:28:43,020 The number is 0. 544 00:28:43,020 --> 00:28:45,180 What's magic? 545 00:28:45,180 --> 00:28:47,180 AUDIENCE: [INAUDIBLE]. 546 00:28:47,180 --> 00:28:49,680 Well, I mean, you can calculate it, right? 547 00:28:49,680 --> 00:28:52,260 PROFESSOR: So magic is based to the size mod p. 548 00:28:52,260 --> 00:28:53,894 What size? 549 00:28:53,894 --> 00:28:56,254 AUDIENCE: [INAUDIBLE] 0. 550 00:28:56,254 --> 00:28:58,076 Just one mod p. 551 00:28:58,076 --> 00:28:58,700 PROFESSOR: Yep. 552 00:28:58,700 --> 00:29:02,340 So when I start, I have an empty sliding window. 553 00:29:02,340 --> 00:29:05,160 Nothing in there, size is 0, base to the size 554 00:29:05,160 --> 00:29:08,490 is 1, whatever the size is. 555 00:29:08,490 --> 00:29:09,600 Very good. 556 00:29:09,600 --> 00:29:10,410 Let's write append. 557 00:29:17,390 --> 00:29:20,310 Hash is? 558 00:29:20,310 --> 00:29:23,720 So here, we're doing both an append and the skip. 559 00:29:23,720 --> 00:29:26,056 We have to figure out which operation belongs 560 00:29:26,056 --> 00:29:28,180 to the append, which operations belong to the skip. 561 00:29:28,180 --> 00:29:29,967 So someone help me out. 562 00:29:33,630 --> 00:29:36,110 AUDIENCE: We know subtraction would [INAUDIBLE]-- 563 00:29:36,110 --> 00:29:39,610 AUDIENCE: Multiply mod base [INAUDIBLE]. 564 00:29:39,610 --> 00:29:40,310 PROFESSOR: Yup. 565 00:29:40,310 --> 00:29:41,943 So this is the append, right? 566 00:29:45,330 --> 00:29:47,300 And this is the skip. 567 00:29:51,010 --> 00:29:53,820 So hash equals hash. 568 00:29:57,500 --> 00:30:04,900 Times base plus new mod p. 569 00:30:04,900 --> 00:30:05,400 Very good. 570 00:30:05,400 --> 00:30:06,300 This is important. 571 00:30:06,300 --> 00:30:08,400 If you don't put this in, Python knows 572 00:30:08,400 --> 00:30:10,760 how to deal with big numbers. 573 00:30:10,760 --> 00:30:12,990 So it will take your code and it'll run it, 574 00:30:12,990 --> 00:30:14,820 and you'll get the correct output. 575 00:30:14,820 --> 00:30:17,060 But hash will keep growing and growing and growing 576 00:30:17,060 --> 00:30:19,830 because you're computing n instead of hash. 577 00:30:19,830 --> 00:30:22,710 And you'll wonder why the code is so slow. 578 00:30:22,710 --> 00:30:24,231 So don't forget this. 579 00:30:24,231 --> 00:30:25,480 What else do I need to update? 580 00:30:29,720 --> 00:30:32,010 OK, I don't have a constant for that, 581 00:30:32,010 --> 00:30:36,360 but I have a constant I for something else. 582 00:30:36,360 --> 00:30:38,460 Magic. 583 00:30:38,460 --> 00:30:40,980 So magic is base to the size mod p. 584 00:30:40,980 --> 00:30:42,549 So what happened to the window size? 585 00:30:42,549 --> 00:30:43,090 AUDIENCE: Oh. 586 00:30:43,090 --> 00:30:44,545 Times base [INAUDIBLE]. 587 00:30:44,545 --> 00:30:45,420 PROFESSOR: Excellent. 588 00:30:45,420 --> 00:30:47,210 The window size grows by 1, therefore, 589 00:30:47,210 --> 00:30:50,670 I have to multiply this by base. 590 00:30:50,670 --> 00:30:56,110 Magic times base mod p. 591 00:30:56,110 --> 00:30:58,842 AUDIENCE: Does p always have to be less then the base, 592 00:30:58,842 --> 00:31:01,740 or can it be anything? 593 00:31:01,740 --> 00:31:03,560 PROFESSOR: It can be bigger than the base. 594 00:31:03,560 --> 00:31:08,080 So if I want to not have a lot of false positives, 595 00:31:08,080 --> 00:31:10,600 then suppose my base is 256, because that's 596 00:31:10,600 --> 00:31:11,650 an extra character. 597 00:31:14,350 --> 00:31:16,750 I was arguing earlier that the number of false positives 598 00:31:16,750 --> 00:31:20,620 that I have is 1/P basically. 599 00:31:20,620 --> 00:31:23,310 So I want p to be as close to the word size as possible. 600 00:31:23,310 --> 00:31:28,670 So p will be around 2 to the 4 billion. 601 00:31:28,670 --> 00:31:29,920 So definitely bigger. 602 00:31:29,920 --> 00:31:31,480 It can work either way. 603 00:31:31,480 --> 00:31:33,850 It's better if it's bigger for the algorithm 604 00:31:33,850 --> 00:31:35,510 that we're using there. 605 00:31:35,510 --> 00:31:38,690 All right, good question, thank you. 606 00:31:38,690 --> 00:31:40,110 Skip. 607 00:31:40,110 --> 00:31:40,990 Let's implement skip. 608 00:31:44,520 --> 00:31:46,556 Hash is? 609 00:31:46,556 --> 00:31:56,016 AUDIENCE: Hash minus old [INAUDIBLE] then comes magic 610 00:31:56,016 --> 00:31:56,516 [INAUDIBLE]. 611 00:32:05,510 --> 00:32:07,400 PROFESSOR: OK, can I write this in Python? 612 00:32:07,400 --> 00:32:10,130 What happens if I write this? 613 00:32:10,130 --> 00:32:16,929 AUDIENCE: [INAUDIBLE] magic is, [INAUDIBLE] We 614 00:32:16,929 --> 00:32:17,970 won't be able to find it. 615 00:32:17,970 --> 00:32:20,560 PROFESSOR: OK so-- sorry, not in Python. 616 00:32:20,560 --> 00:32:23,520 So assume all these are instance variables done the right way, 617 00:32:23,520 --> 00:32:26,960 but what happens if old times magic is bigger than hash? 618 00:32:29,600 --> 00:32:30,900 I get a negative number. 619 00:32:30,900 --> 00:32:34,090 And in math, people assume that if you 620 00:32:34,090 --> 00:32:40,150 do something like minus 3 modulo 23, you're going to get 20. 621 00:32:43,340 --> 00:32:49,190 So modulo is always positive in modular arithmetic, 622 00:32:49,190 --> 00:32:54,780 but in a programming language, if you do minus 3 modulo 20, 623 00:32:54,780 --> 00:32:59,500 I'm pretty sure you're going to get minus 3. 624 00:32:59,500 --> 00:33:00,580 And things will go back. 625 00:33:00,580 --> 00:33:02,530 So we want to get to a positive number 626 00:33:02,530 --> 00:33:06,210 here so that the arithmetic modulo 627 00:33:06,210 --> 00:33:09,050 p will work just like in math. 628 00:33:09,050 --> 00:33:12,687 So we want to add something to make this whole thing positive. 629 00:33:12,687 --> 00:33:15,820 AUDIENCE: That's something times [INAUDIBLE]. 630 00:33:15,820 --> 00:33:18,100 PROFESSOR: OK, so if we're working modulo p 631 00:33:18,100 --> 00:33:22,190 then we can add anything to our number, any multiple of p, 632 00:33:22,190 --> 00:33:24,770 and the result modulo p doesn't change. 633 00:33:24,770 --> 00:33:29,220 For example, here to get from minus 3 to 20, I added 23. 634 00:33:29,220 --> 00:33:31,680 Right? 635 00:33:31,680 --> 00:33:35,280 OK, so I want to add a correction 636 00:33:35,280 --> 00:33:37,920 factor of p times something. 637 00:33:37,920 --> 00:33:38,900 So what should that be? 638 00:33:41,721 --> 00:33:43,970 I want to make sure that this whole thing is positive. 639 00:33:52,855 --> 00:33:53,730 AUDIENCE: [INAUDIBLE] 640 00:33:59,290 --> 00:34:00,290 PROFESSOR: So let's see. 641 00:34:00,290 --> 00:34:01,748 How big are these guys, by the way? 642 00:34:01,748 --> 00:34:04,220 Magic is something mod p, right? 643 00:34:04,220 --> 00:34:07,010 So it's definitely smaller or equal to p. 644 00:34:07,010 --> 00:34:07,840 How about old? 645 00:34:10,822 --> 00:34:12,320 AUDIENCE: [INAUDIBLE] 646 00:34:12,320 --> 00:34:13,760 PROFESSOR: OK. 647 00:34:13,760 --> 00:34:16,083 So smaller or equal than? 648 00:34:16,083 --> 00:34:16,949 AUDIENCE: Base. 649 00:34:16,949 --> 00:34:19,650 PROFESSOR: Base. 650 00:34:19,650 --> 00:34:20,395 Very good. 651 00:34:20,395 --> 00:34:21,770 So this whole thing is definitely 652 00:34:21,770 --> 00:34:25,552 going to be smaller than [INAUDIBLE]. 653 00:34:25,552 --> 00:34:28,010 So this is definitely going to be smaller than base time p, 654 00:34:28,010 --> 00:34:28,500 right? 655 00:34:28,500 --> 00:34:29,583 So let's put that in here. 656 00:34:32,620 --> 00:34:34,949 You can get fancy and say hey, this is smaller than p, 657 00:34:34,949 --> 00:34:37,989 and this is old, so you can put old here instead, same thing. 658 00:34:41,120 --> 00:34:43,710 OK so we have hash. 659 00:34:43,710 --> 00:34:45,019 Now what do we do to magic? 660 00:34:48,880 --> 00:35:00,219 AUDIENCE: [INAUDIBLE] divide it by the base and mod p. 661 00:35:00,219 --> 00:35:04,163 It seems base [? and p ?] don't share factors. 662 00:35:04,163 --> 00:35:05,642 You're allowed to do that? 663 00:35:08,150 --> 00:35:09,820 PROFESSOR: OK, so skip part two. 664 00:35:14,910 --> 00:35:33,940 Magic equals-- So what if my magic is something like 5 665 00:35:33,940 --> 00:35:35,480 and my base is 100? 666 00:35:35,480 --> 00:35:36,990 How is this going to work? 667 00:35:42,580 --> 00:35:45,260 This is where we use fancy math. 668 00:35:45,260 --> 00:35:47,670 And I call it fancy math because I 669 00:35:47,670 --> 00:35:49,430 didn't learn it in high school. 670 00:35:49,430 --> 00:35:53,190 So I'm assuming at least some of you do not know how this works. 671 00:35:53,190 --> 00:35:56,336 So if we're working modulo p, you 672 00:35:56,336 --> 00:35:58,750 can think 23 if you prefer concrete numbers instead. 673 00:36:03,070 --> 00:36:10,270 For any number between 1 and p minus 1, 674 00:36:10,270 --> 00:36:13,710 there's something called the multiplicative inverse, 675 00:36:13,710 --> 00:36:17,260 a to the minus 1, that also happens 676 00:36:17,260 --> 00:36:20,270 to be an integer between 1 and p minus 1. 677 00:36:20,270 --> 00:36:26,510 And if you multiply, say, a times b, that's another number. 678 00:36:26,510 --> 00:36:29,550 And then you multiply this by a minus 1, 679 00:36:29,550 --> 00:36:34,370 you're going to get to b modulo p. 680 00:36:34,370 --> 00:36:38,202 So a minus 1 cancels a in a multiplication. 681 00:36:38,202 --> 00:36:40,160 Now let's see if you guys are paying attention. 682 00:36:40,160 --> 00:36:43,107 What's a times a to the minus 1 modulo p? 683 00:36:43,107 --> 00:36:44,540 AUDIENCE: 1. 684 00:36:44,540 --> 00:36:45,170 PROFESSOR: OK. 685 00:36:45,170 --> 00:36:45,669 Sweet. 686 00:36:48,460 --> 00:36:53,770 So suppose I want to find the multiplicative inverse of 6. 687 00:36:53,770 --> 00:36:54,826 What is it? 688 00:36:54,826 --> 00:36:57,010 AUDIENCE: Is that the mod 23? 689 00:36:57,010 --> 00:36:59,186 PROFESSOR: Yeah. 690 00:36:59,186 --> 00:37:00,880 Can someone think of what it should be? 691 00:37:00,880 --> 00:37:01,680 AUDIENCE: 4. 692 00:37:01,680 --> 00:37:04,560 PROFESSOR: 4, wow, fast. 693 00:37:04,560 --> 00:37:12,230 So 6 times 4 equals 24, which is 1 modulo 23. 694 00:37:12,230 --> 00:37:16,520 Now let's see if this magic really works, this math magic. 695 00:37:16,520 --> 00:37:19,968 So 6 times 7 equals? 696 00:37:19,968 --> 00:37:21,330 AUDIENCE: 42. 697 00:37:21,330 --> 00:37:24,700 PROFESSOR: Which is what mod 23? 698 00:37:24,700 --> 00:37:25,796 Computer guys. 699 00:37:25,796 --> 00:37:26,990 AUDIENCE: Negative 4, so 5. 700 00:37:26,990 --> 00:37:29,815 Ah, just kidding. 701 00:37:29,815 --> 00:37:30,315 Yeah. 702 00:37:35,070 --> 00:37:39,180 PROFESSOR: OK now let's multiply 19 by 4. 703 00:37:39,180 --> 00:37:41,177 What is this? 704 00:37:41,177 --> 00:37:41,718 AUDIENCE: 76. 705 00:37:44,640 --> 00:37:46,190 PROFESSOR: All right, 76 modulo 23? 706 00:37:49,983 --> 00:37:54,210 AUDIENCE: 7 maybe. 707 00:37:54,210 --> 00:37:55,926 PROFESSOR: Are you kidding? 708 00:37:55,926 --> 00:37:57,735 Did you compute it, or did you use-- 709 00:37:57,735 --> 00:38:01,040 AUDIENCE: 69 [INAUDIBLE] 710 00:38:01,040 --> 00:38:02,240 PROFESSOR: OK. 711 00:38:02,240 --> 00:38:05,660 Started with 7, ended with 7. 712 00:38:05,660 --> 00:38:06,415 So this works. 713 00:38:10,640 --> 00:38:13,650 So as long we're working modulo a prime number, 714 00:38:13,650 --> 00:38:17,450 we can always compute multiplicative inverses. 715 00:38:17,450 --> 00:38:19,490 And Python has a function for that, 716 00:38:19,490 --> 00:38:21,770 so I'll let you Google its standard library 717 00:38:21,770 --> 00:38:23,126 to find out what it is. 718 00:38:23,126 --> 00:38:24,750 But it can be done, that's what matters 719 00:38:24,750 --> 00:38:27,930 as far as we're concerned. 720 00:38:27,930 --> 00:38:35,880 So we're going to say that magic is magic times base minus 1 mod 721 00:38:35,880 --> 00:38:41,230 p, which is the multiplicative inverse everything mod p. 722 00:38:41,230 --> 00:38:43,690 Now suppose this base minus 1 modulo p, 723 00:38:43,690 --> 00:38:48,090 this multiplicative inverse algorithm is really slow. 724 00:38:48,090 --> 00:38:54,820 What do we do to stay order 1? 725 00:38:54,820 --> 00:38:55,730 Pre compute it. 726 00:38:55,730 --> 00:38:58,810 Base is not going to change. 727 00:38:58,810 --> 00:38:59,990 Very good. 728 00:38:59,990 --> 00:39:07,900 So the inverse of base, I base, is base minus 1 mod p. 729 00:39:11,040 --> 00:39:15,580 So here I replace this with I base. 730 00:39:18,180 --> 00:39:20,810 OK so skip part one is there, skip part two is here. 731 00:39:23,350 --> 00:39:24,580 Does this make sense so far? 732 00:39:29,050 --> 00:39:30,720 I see some confusion. 733 00:39:30,720 --> 00:39:32,910 AUDIENCE: A lot. 734 00:39:32,910 --> 00:39:34,410 PROFESSOR: A lot to take in at once? 735 00:39:34,410 --> 00:39:35,480 AUDIENCE: Yes. 736 00:39:35,480 --> 00:39:37,120 PROFESSOR: OK. 737 00:39:37,120 --> 00:39:41,670 So remember this concept. 738 00:39:41,670 --> 00:39:43,640 So this is where we started from. 739 00:39:43,640 --> 00:39:46,020 Then we computed n, then after n, we 740 00:39:46,020 --> 00:39:49,310 worked modulo p to gets to hashes. 741 00:39:49,310 --> 00:39:52,914 So by working module p, we're able to get rid 742 00:39:52,914 --> 00:39:54,330 of all the big numbers and we only 743 00:39:54,330 --> 00:39:58,070 have small numbers in our rolling hash. 744 00:39:58,070 --> 00:40:01,050 And there's that curveball there, there 745 00:40:01,050 --> 00:40:03,090 is that inverse, multiplicative inverse, 746 00:40:03,090 --> 00:40:05,120 but Python computes it for you, so 747 00:40:05,120 --> 00:40:06,937 as long as it's in the initializer, 748 00:40:06,937 --> 00:40:08,520 here you don't need to worry about it, 749 00:40:08,520 --> 00:40:12,530 because it's not part of the rolling hash operations. 750 00:40:12,530 --> 00:40:15,270 By the way, what's the cost of the rolling hash operations? 751 00:40:15,270 --> 00:40:16,471 What's the cost of new? 752 00:40:20,226 --> 00:40:21,600 Sorry, what's the cost of append? 753 00:40:21,600 --> 00:40:22,480 Not thinking here. 754 00:40:26,750 --> 00:40:27,380 Constant. 755 00:40:27,380 --> 00:40:29,460 All these are small numbers, so the arithmetic 756 00:40:29,460 --> 00:40:31,880 is constant, right? 757 00:40:31,880 --> 00:40:34,965 What's the cost of skip? 758 00:40:34,965 --> 00:40:36,860 Skip part 1 here, skip part two there. 759 00:40:36,860 --> 00:40:39,440 What's the cost of skip? 760 00:40:39,440 --> 00:40:40,210 Constant. . 761 00:40:40,210 --> 00:40:41,437 All the numbers are small. 762 00:40:41,437 --> 00:40:43,270 We went through a lot of effort to get that, 763 00:40:43,270 --> 00:40:46,740 so skip is order 1. 764 00:40:46,740 --> 00:40:47,770 We're missing hash. 765 00:40:47,770 --> 00:40:49,850 How would we implement the hash operation? 766 00:40:49,850 --> 00:40:50,850 A hash query. 767 00:40:55,610 --> 00:40:57,030 It's easy. 768 00:40:57,030 --> 00:40:57,851 Sorry? 769 00:40:57,851 --> 00:41:01,540 AUDIENCE: [INAUDIBLE] lookup [INAUDIBLE]. 770 00:41:01,540 --> 00:41:05,650 PROFESSOR: So a rolling hash has append, skip, and hash. 771 00:41:05,650 --> 00:41:09,890 I want to implement that hash function. 772 00:41:09,890 --> 00:41:11,610 Hash. 773 00:41:11,610 --> 00:41:15,470 We're computing hash all the time. 774 00:41:15,470 --> 00:41:16,080 Return. 775 00:41:16,080 --> 00:41:19,120 Sorry, I didn't understand what you meant by lookup. 776 00:41:19,120 --> 00:41:20,980 AUDIENCE: It's one of our states. 777 00:41:20,980 --> 00:41:21,650 PROFESSOR: Yeah. 778 00:41:21,650 --> 00:41:22,150 Exactly. 779 00:41:22,150 --> 00:41:25,690 So the hash function returns hash, right? 780 00:41:25,690 --> 00:41:26,700 What's the cost of that? 781 00:41:29,290 --> 00:41:32,240 Constant. 782 00:41:32,240 --> 00:41:37,290 So append is constant time, yes? 783 00:41:37,290 --> 00:41:38,740 Skip is constant time. 784 00:41:38,740 --> 00:41:40,140 Hash is constant time. 785 00:41:40,140 --> 00:41:40,880 We're done. 786 00:41:40,880 --> 00:41:41,730 This works. 787 00:41:41,730 --> 00:41:43,510 Any questions on rolling hashes before you 788 00:41:43,510 --> 00:41:46,804 have to implement one of your own? 789 00:41:46,804 --> 00:41:49,750 AUDIENCE: [INAUDIBLE] wouldn't it 790 00:41:49,750 --> 00:41:53,187 be easier to use a shift function? 791 00:41:53,187 --> 00:41:56,089 Then you don't have to think about plus and minus. 792 00:41:56,089 --> 00:41:57,255 PROFESSOR: A shift function. 793 00:41:57,255 --> 00:42:01,970 AUDIENCE: Well I mean like, you can shift bit-wise, right? 794 00:42:01,970 --> 00:42:02,877 PROFESSOR: OK. 795 00:42:02,877 --> 00:42:04,865 AUDIENCE: So you can just use shift 796 00:42:04,865 --> 00:42:06,853 instead of thinking about where to add this, 797 00:42:06,853 --> 00:42:08,841 where to subtract this. 798 00:42:08,841 --> 00:42:10,680 PROFESSOR: Well so I do bit operations 799 00:42:10,680 --> 00:42:13,028 if I'm willing to work with these big numbers. 800 00:42:13,028 --> 00:42:14,611 AUDIENCE: But then you have to compute 801 00:42:14,611 --> 00:42:16,798 the mod of some big number, right? 802 00:42:16,798 --> 00:42:18,742 Like just like that. 803 00:42:18,742 --> 00:42:21,334 For this one, you don't have to, because you 804 00:42:21,334 --> 00:42:23,116 have the original hash [INAUDIBLE]. 805 00:42:27,429 --> 00:42:28,970 AUDIENCE: Oh, you mean the big number 806 00:42:28,970 --> 00:42:31,970 being the actual word you're looking at? 807 00:42:31,970 --> 00:42:33,080 AUDIENCE: Yeah. 808 00:42:33,080 --> 00:42:35,390 PROFESSOR: So doing shift is equivalent to maintaining 809 00:42:35,390 --> 00:42:39,870 a list and pushing and popping things into the list. 810 00:42:39,870 --> 00:42:41,770 And then you have to do a hash, it's 811 00:42:41,770 --> 00:42:44,020 equivalent to looking over the entire list 812 00:42:44,020 --> 00:42:45,560 and computing the hash function. 813 00:42:45,560 --> 00:42:47,726 Because you'd have a big number and you have to take 814 00:42:47,726 --> 00:42:49,000 it modulo 23. 815 00:42:49,000 --> 00:42:51,886 And that's order of the size of the big number. 816 00:42:51,886 --> 00:42:53,260 But we're not allowed to do that. 817 00:42:53,260 --> 00:42:56,430 Hash has to be constant time, otherwise this thing is slow. 818 00:43:00,705 --> 00:43:03,412 AUDIENCE: Why do we compute magic numbers then? 819 00:43:03,412 --> 00:43:04,870 PROFESSOR: Why do we compute magic? 820 00:43:04,870 --> 00:43:09,770 We compute magic because somewhere here, we 821 00:43:09,770 --> 00:43:14,200 had this base to the size mod p and this could get big. 822 00:43:16,750 --> 00:43:18,890 So I can't afford to keep it around and do math 823 00:43:18,890 --> 00:43:20,020 with it all the time. 824 00:43:20,020 --> 00:43:21,580 So I can't compute base to the size 825 00:43:21,580 --> 00:43:25,346 every time I want to do append. 826 00:43:31,115 --> 00:43:33,365 AUDIENCE: Would it be worth it if you're computing 100 827 00:43:33,365 --> 00:43:37,496 different values for matching and [INAUDIBLE], 828 00:43:37,496 --> 00:43:41,645 so all you'd have to do is, when reassigning magic, 829 00:43:41,645 --> 00:43:42,337 just look up-- 830 00:43:42,337 --> 00:43:43,920 PROFESSOR: So if you do that, then you 831 00:43:43,920 --> 00:43:49,860 have to compute values for all the sizes, right? 832 00:43:49,860 --> 00:43:51,040 For all the window sizes. 833 00:43:51,040 --> 00:43:52,460 AUDIENCE: Right. 834 00:43:52,460 --> 00:43:55,237 So if we assume that window sizes will be less than 100, 835 00:43:55,237 --> 00:43:56,320 it doesn't take very long. 836 00:43:56,320 --> 00:43:58,528 PROFESSOR: Well what if the window size is 1 million? 837 00:44:00,810 --> 00:44:02,350 What if I'm looking for a 1 million 838 00:44:02,350 --> 00:44:05,217 character in a 1 gigabyte string? 839 00:44:05,217 --> 00:44:07,550 AUDIENCE: But wouldn't after all, wouldn't the size just 840 00:44:07,550 --> 00:44:11,971 be around the string, like plus or minus the size of the base? 841 00:44:11,971 --> 00:44:12,471 So-- 842 00:44:12,471 --> 00:44:13,920 AUDIENCE: Only if [INAUDIBLE] 843 00:44:13,920 --> 00:44:16,335 So why would the size change again? 844 00:44:16,335 --> 00:44:17,895 Why wouldn't it just be-- I mean, 845 00:44:17,895 --> 00:44:19,825 if you're looking at one character. 846 00:44:19,825 --> 00:44:21,950 PROFESSOR: So if I have a sliding window like this, 847 00:44:21,950 --> 00:44:22,908 then it doesn't change. 848 00:44:22,908 --> 00:44:24,790 But if I want to implement a rolling hash, 849 00:44:24,790 --> 00:44:28,960 that's a bit more general and that supports append and skip. 850 00:44:28,960 --> 00:44:30,856 Whenever I append, the size increases. 851 00:44:30,856 --> 00:44:32,355 Whenever I skip, the size decreases. 852 00:44:32,355 --> 00:44:34,688 AUDIENCE: Oh, you're not doing those at every time step. 853 00:44:34,688 --> 00:44:37,089 You're doing them as needed. 854 00:44:37,089 --> 00:44:38,630 PROFESSOR: So I'm trying to implement 855 00:44:38,630 --> 00:44:40,730 that, that can do them in any sequence. 856 00:44:40,730 --> 00:44:41,859 AUDIENCE: Oh. 857 00:44:41,859 --> 00:44:42,400 AUDIENCE: OK. 858 00:44:42,400 --> 00:44:45,240 I thought we were just doing sliding window. 859 00:44:45,240 --> 00:44:47,860 PROFESSOR: So if we're just doing sliding window, you can-- 860 00:44:47,860 --> 00:44:50,555 AUDIENCE: This is really more caterpillar hash instead 861 00:44:50,555 --> 00:44:53,124 of rolling hash, like it's more general. 862 00:44:53,124 --> 00:44:53,790 PROFESSOR: Yeah. 863 00:44:53,790 --> 00:44:55,070 It's a bit more general. 864 00:44:55,070 --> 00:44:57,390 So let's look at rolling hash for the window. 865 00:44:57,390 --> 00:45:00,270 And what you're saying is, hey, the window size 866 00:45:00,270 --> 00:45:02,151 is constant, so-- 867 00:45:02,151 --> 00:45:06,480 AUDIENCE: Why do we repeat magic [INAUDIBLE]? 868 00:45:06,480 --> 00:45:08,650 PROFESSOR: Yeah, if the window size is constant, 869 00:45:08,650 --> 00:45:10,040 then we wouldn't re compute it. 870 00:45:10,040 --> 00:45:10,831 It wouldn't change. 871 00:45:14,040 --> 00:45:17,080 But with this thing, it's not. 872 00:45:17,080 --> 00:45:18,720 OK. 873 00:45:18,720 --> 00:45:21,502 AUDIENCE: But I guess it doesn't really matter, 874 00:45:21,502 --> 00:45:27,238 but even if you call these in the same order, 875 00:45:27,238 --> 00:45:30,626 then isn't that wasting a lot of computing cycles 876 00:45:30,626 --> 00:45:36,450 because just shrinking and then growing every single operation? 877 00:45:36,450 --> 00:45:39,400 PROFESSOR: Oh, well it turns out that a lot of computing cycles 878 00:45:39,400 --> 00:45:40,720 is still order one, right? 879 00:45:40,720 --> 00:45:41,720 Everything is order one. 880 00:45:41,720 --> 00:45:45,090 So as algorithms people, we don't care. 881 00:45:45,090 --> 00:45:47,380 If you're doing it in a system and you actually 882 00:45:47,380 --> 00:45:49,222 care about that, then OK. 883 00:45:49,222 --> 00:45:50,930 But you're still going to have to compute 884 00:45:50,930 --> 00:45:53,136 the initial value at some point. 885 00:45:53,136 --> 00:45:55,302 AUDIENCE: But if you know window's staying the same, 886 00:45:55,302 --> 00:45:58,320 you don't need to that computation every time? 887 00:45:58,320 --> 00:45:59,485 PROFESSOR: If you-- sorry? 888 00:45:59,485 --> 00:46:01,026 AUDIENCE: If you know you're actually 889 00:46:01,026 --> 00:46:02,730 just doing a window rolling hash-- 890 00:46:02,730 --> 00:46:03,355 PROFESSOR: Yup. 891 00:46:03,355 --> 00:46:05,400 So then you would initialize magic here 892 00:46:05,400 --> 00:46:09,690 to be whatever you want it to be, right? 893 00:46:09,690 --> 00:46:12,780 But then when you add the first few characters to the window, 894 00:46:12,780 --> 00:46:16,630 you have to figure out how to add them. 895 00:46:16,630 --> 00:46:17,840 So the code gets more messy. 896 00:46:17,840 --> 00:46:19,750 It turns out that this is actually 897 00:46:19,750 --> 00:46:24,070 simpler than doing it that way. 898 00:46:24,070 --> 00:46:27,082 AUDIENCE: [INAUDIBLE] magic I guess 899 00:46:27,082 --> 00:46:29,540 I'm just confused because it seems like we're still working 900 00:46:29,540 --> 00:46:32,687 with the large numbers every time [INAUDIBLE]. 901 00:46:32,687 --> 00:46:33,270 PROFESSOR: Oh. 902 00:46:33,270 --> 00:46:35,020 Let's see. 903 00:46:35,020 --> 00:46:37,547 Mod p, mod p. 904 00:46:37,547 --> 00:46:40,130 AUDIENCE: That's not-- so even though you're still multiplying 905 00:46:40,130 --> 00:46:43,960 magic times base, it doesn't matter. 906 00:46:43,960 --> 00:46:46,960 PROFESSOR: After I'm going that, I'm reducing it modulo p. 907 00:46:46,960 --> 00:46:47,619 Yeah. 908 00:46:47,619 --> 00:46:49,160 AUDIENCE: And then because we're only 909 00:46:49,160 --> 00:46:51,080 working with the smaller values. 910 00:46:51,080 --> 00:46:51,760 PROFESSOR: Yup. 911 00:46:51,760 --> 00:46:56,760 So everything here stays between 0 and base or 0 and p. 912 00:46:56,760 --> 00:47:02,513 Actually hash is between 0 and p and magic is between 0 and p. 913 00:47:02,513 --> 00:47:03,012 OK. 914 00:47:03,012 --> 00:47:06,026 AUDIENCE: How big does p usually get? 915 00:47:06,026 --> 00:47:08,450 PROFESSOR: How big does p usually get. 916 00:47:08,450 --> 00:47:11,445 So And let me get back to this. 917 00:47:14,450 --> 00:47:16,860 So I was arguing that the number of false positives 918 00:47:16,860 --> 00:47:18,810 here is one over O, right? 919 00:47:23,390 --> 00:47:26,490 is the number of values that the hash function can output. 920 00:47:26,490 --> 00:47:29,360 How many hash functions can we output using a rolling hash? 921 00:47:32,258 --> 00:47:33,224 AUDIENCE: P. 922 00:47:33,224 --> 00:47:33,970 PROFESSOR: P. OK. 923 00:47:33,970 --> 00:47:37,100 So the number of false positives is 924 00:47:37,100 --> 00:47:39,232 1/P. So what do we want for p? 925 00:47:39,232 --> 00:47:41,690 AUDIENCE: We want p to be the word size, because but if p's 926 00:47:41,690 --> 00:47:43,726 the word size, then-- 927 00:47:43,726 --> 00:47:45,350 PROFESSOR: So p can't be the word size, 928 00:47:45,350 --> 00:47:46,930 because it has to be prime, right? 929 00:47:46,930 --> 00:47:49,700 But we want it to be big, because as p becomes bigger, 930 00:47:49,700 --> 00:47:51,800 1/P becomes smaller. 931 00:47:51,800 --> 00:47:55,840 So there are two constraints. 932 00:47:55,840 --> 00:47:57,440 We want p to be big so that we don't 933 00:47:57,440 --> 00:47:59,130 have a lot of false positives. 934 00:47:59,130 --> 00:48:03,230 And we want p to be small so that operations 935 00:48:03,230 --> 00:48:05,070 don't take a lot of time. 936 00:48:05,070 --> 00:48:07,660 So in engineering, this is how things work. 937 00:48:07,660 --> 00:48:09,410 We call it a tradeoff because there 938 00:48:09,410 --> 00:48:11,640 are forces pushing in opposite directions, 939 00:48:11,640 --> 00:48:14,170 and it turns out that a reasonable answer to the trade 940 00:48:14,170 --> 00:48:18,120 off is you make p fit in a word so that all those operations 941 00:48:18,120 --> 00:48:22,582 are still implementable by one CPU instruction. 942 00:48:22,582 --> 00:48:24,040 You can't have it be the word size. 943 00:48:24,040 --> 00:48:26,240 So if we're working on a 32-bit computer, 944 00:48:26,240 --> 00:48:29,390 I can't have this be 2 to the 32. 945 00:48:29,390 --> 00:48:33,330 But I can have a prime number that's just a little bit 946 00:48:33,330 --> 00:48:34,640 smaller than 2 to the 32. 947 00:48:34,640 --> 00:48:38,888 AUDIENCE: Wait, why can't it be the word size? 948 00:48:38,888 --> 00:48:41,250 Or why can't it be 2 to the 32? 949 00:48:41,250 --> 00:48:43,610 PROFESSOR: So if p would be this instead of a prime, 950 00:48:43,610 --> 00:48:45,047 then I can't do this. 951 00:48:45,047 --> 00:48:47,130 AUDIENCE: Oh, right right right, yeah I knew that. 952 00:48:47,130 --> 00:48:48,330 PROFESSOR: There are a lot of moving parts here 953 00:48:48,330 --> 00:48:49,777 and they're all interconnected. 954 00:48:49,777 --> 00:48:52,110 AUDIENCE: You could do that for any prime number, right? 955 00:48:52,110 --> 00:48:52,845 PROFESSOR: Yup. 956 00:48:52,845 --> 00:48:54,470 So this works for prime numbers, but it 957 00:48:54,470 --> 00:48:57,186 doesn't work for non prime numbers. 958 00:48:57,186 --> 00:48:59,310 AUDIENCE: You could find the multiplicative inverse 959 00:48:59,310 --> 00:49:03,714 for any prime number in base 32. 960 00:49:03,714 --> 00:49:04,545 Is that true? 961 00:49:04,545 --> 00:49:06,886 I mean any odd number is what I'm trying to say. 962 00:49:06,886 --> 00:49:08,107 No, that's not true. 963 00:49:08,107 --> 00:49:10,190 PROFESSOR: I refuse to answer hard math questions. 964 00:49:10,190 --> 00:49:11,981 AUDIENCE: They need to be relatively prime. 965 00:49:11,981 --> 00:49:13,810 They need to share no factors. 966 00:49:13,810 --> 00:49:15,840 PROFESSOR: Yes, it might be true. 967 00:49:15,840 --> 00:49:19,345 AUDIENCE: So an odd will not share a factor with 2 968 00:49:19,345 --> 00:49:21,040 to the 32? 969 00:49:21,040 --> 00:49:23,165 PROFESSOR: You're forcing me to remember hard math. 970 00:49:23,165 --> 00:49:25,206 AUDIENCE: Yeah, I totally just thought about this 971 00:49:25,206 --> 00:49:26,880 as [INAUDIBLE] number. 972 00:49:26,880 --> 00:49:29,890 PROFESSOR: So, no, it turns out that there's 973 00:49:29,890 --> 00:49:33,620 no-- if you're working modulo and non prime base, then 974 00:49:33,620 --> 00:49:35,620 there's no multiplicity inverses. 975 00:49:35,620 --> 00:49:38,390 So some numbers have no multiplicative inverses, 976 00:49:38,390 --> 00:49:42,730 and other numbers have more than one multiplicative inverse. 977 00:49:42,730 --> 00:49:44,870 And then the whole thing doesn't work. 978 00:49:44,870 --> 00:49:47,970 So let me see if I can make this work 979 00:49:47,970 --> 00:49:50,020 without having an example by hand. 980 00:49:50,020 --> 00:49:53,290 Let's say we're working mod 8, right? 981 00:49:53,290 --> 00:49:53,790 Mod 8. 982 00:49:53,790 --> 00:49:58,390 So 2 to the minus 1 mod 8 is not going to exist, right? 983 00:49:58,390 --> 00:49:59,650 AUDIENCE: Right, but 3 will. 984 00:49:59,650 --> 00:50:00,191 PROFESSOR: 3. 985 00:50:00,191 --> 00:50:02,430 Let's see what do we use? 986 00:50:02,430 --> 00:50:04,670 3 times 3 is 9, right? 987 00:50:04,670 --> 00:50:06,580 So this is 1. 988 00:50:06,580 --> 00:50:08,180 How about 3 times 5? 989 00:50:13,040 --> 00:50:16,540 15 mod 8 is 7. 990 00:50:16,540 --> 00:50:19,051 So 3 and 5, and then-- 991 00:50:19,051 --> 00:50:19,592 AUDIENCE: 11. 992 00:50:24,207 --> 00:50:24,790 PROFESSOR: OK. 993 00:50:27,780 --> 00:50:30,250 So 3 times 7 would be 21, 5. 994 00:50:34,030 --> 00:50:40,390 OK so 3 and-- 3 is the multiplicative inverse 995 00:50:40,390 --> 00:50:43,281 of itself, and 5 and 7 are-- yeah. 996 00:50:43,281 --> 00:50:45,690 I have to build a more complicated example, 997 00:50:45,690 --> 00:50:48,888 but this breaks down in some cases. 998 00:50:52,100 --> 00:50:53,690 I'll have to get back to you. 999 00:50:53,690 --> 00:50:56,074 I will look at my notes for modular arithmetic 1000 00:50:56,074 --> 00:50:57,740 and I'll get back to you guys over email 1001 00:50:57,740 --> 00:50:59,560 for why and how that breaks. 1002 00:50:59,560 --> 00:51:00,225 Yes. 1003 00:51:00,225 --> 00:51:01,808 AUDIENCE: Sorry, can you tell me again 1004 00:51:01,808 --> 00:51:03,239 why we did the part 2 in skip? 1005 00:51:03,239 --> 00:51:06,578 Like why did we do that? 1006 00:51:06,578 --> 00:51:10,290 I'm not really sure [INAUDIBLE]. 1007 00:51:10,290 --> 00:51:13,120 PROFESSOR: So we started with magic 1 1008 00:51:13,120 --> 00:51:16,840 and then we-- in order for this to work, 1009 00:51:16,840 --> 00:51:20,279 we agree that magic will be base to the size modulo p 1010 00:51:20,279 --> 00:51:20,820 all the time. 1011 00:51:20,820 --> 00:51:23,900 So this has to be [INAUDIBLE] invariant for my rolling hash. 1012 00:51:23,900 --> 00:51:27,770 When I do an append, the size increases by 1. 1013 00:51:27,770 --> 00:51:31,010 And then I multiply by base to modulo p. 1014 00:51:31,010 --> 00:51:35,610 When I do a skip, the size decreases by 1. 1015 00:51:35,610 --> 00:51:39,360 So I have to change magic, because magic is always 1016 00:51:39,360 --> 00:51:43,100 base times size, so I have to update it. 1017 00:51:43,100 --> 00:51:44,740 So this is why this happened. 1018 00:51:44,740 --> 00:51:47,200 Because initially, I wanted to update 1019 00:51:47,200 --> 00:51:49,580 by dividing it by base, right? 1020 00:51:49,580 --> 00:51:51,750 Magic divided by base. 1021 00:51:51,750 --> 00:51:54,080 But if magic is 5 and base is 100, 1022 00:51:54,080 --> 00:51:56,450 we're not going to get an integer. 1023 00:51:56,450 --> 00:51:58,630 And we want to stay within integers, 1024 00:51:58,630 --> 00:52:02,740 so that's when I pulled out fancy math and-- OK. 1025 00:52:05,560 --> 00:52:06,670 OK. 1026 00:52:06,670 --> 00:52:08,800 So how are we doing with rolling hashes? 1027 00:52:08,800 --> 00:52:10,764 Good? 1028 00:52:10,764 --> 00:52:12,930 AUDIENCE: All this math will be in the notes, right? 1029 00:52:12,930 --> 00:52:13,846 PROFESSOR: Everything. 1030 00:52:13,846 --> 00:52:16,920 Oh, yeah, everything else will be in the notes. 1031 00:52:16,920 --> 00:52:22,880 Before we close out, I want to show you one cute thing. 1032 00:52:22,880 --> 00:52:24,464 Who remembers amortized analysis? 1033 00:52:24,464 --> 00:52:26,630 I know there's one person that said they understood. 1034 00:52:30,490 --> 00:52:31,925 All 1035 00:52:31,925 --> 00:52:33,550 PROFESSOR: The growing, shrinking thing 1036 00:52:33,550 --> 00:52:34,633 is what we did in lecture. 1037 00:52:34,633 --> 00:52:36,290 I want to show something else. 1038 00:52:36,290 --> 00:52:37,954 I want to show you a binary tree. 1039 00:52:37,954 --> 00:52:40,370 A binary search tree, because you've seen this on the PSAT 1040 00:52:40,370 --> 00:52:43,300 and you already hate it. 1041 00:52:43,300 --> 00:52:45,180 AUDIENCE: Why'd they call it amortization? 1042 00:52:45,180 --> 00:52:47,640 Because I looked it up online, it means to kill, 1043 00:52:47,640 --> 00:52:51,760 and so I'm like, why not say like, attrition or something 1044 00:52:51,760 --> 00:52:53,270 else that's a little bit less-- 1045 00:52:53,270 --> 00:52:58,410 PROFESSOR: Amortization is also used in accounting to mean 1046 00:52:58,410 --> 00:52:59,260 you're-- 1047 00:52:59,260 --> 00:53:00,093 [INTERPOSING VOICES] 1048 00:53:05,594 --> 00:53:07,510 PROFESSOR: Let's use the growing hash example, 1049 00:53:07,510 --> 00:53:10,300 because that's good for why this is the case. 1050 00:53:10,300 --> 00:53:13,470 So when you're growing your table, you're inserting. 1051 00:53:13,470 --> 00:53:15,230 If you still have space, that's order one. 1052 00:53:15,230 --> 00:53:18,540 If not, you have to grow your table to insert. 1053 00:53:18,540 --> 00:53:20,710 And that is more expensive. 1054 00:53:20,710 --> 00:53:24,220 That's order n where n is how many elements you had before. 1055 00:53:24,220 --> 00:53:27,700 So if you graph this costs, if you start off 1056 00:53:27,700 --> 00:53:30,560 with a table of size one, you can insert the first element 1057 00:53:30,560 --> 00:53:32,482 for a cost of one. 1058 00:53:32,482 --> 00:53:34,690 For the second element, you have to resize the table, 1059 00:53:34,690 --> 00:53:35,720 so it's a cost of two. 1060 00:53:38,495 --> 00:53:40,620 Now when you're trying to insert the third element, 1061 00:53:40,620 --> 00:53:43,600 you have to resize the table again to a size of 4. 1062 00:53:43,600 --> 00:53:46,460 But when you insert the fourth element, it's free. 1063 00:53:46,460 --> 00:53:48,400 Well, cost of one. 1064 00:53:48,400 --> 00:53:50,690 When you insert the fifth element, 1065 00:53:50,690 --> 00:53:53,490 you have to resize a table to the size of eight, right? 1066 00:53:53,490 --> 00:53:57,540 So The table size is one, two, four, four, and now it's eight. 1067 00:53:57,540 --> 00:53:59,560 But because they resized this to eight, 1068 00:53:59,560 --> 00:54:02,730 the next three assertions are going to be order one. 1069 00:54:02,730 --> 00:54:07,350 And then the one after that is going to make the table be 16. 1070 00:54:07,350 --> 00:54:11,765 So I can do seven insertions for free 1071 00:54:11,765 --> 00:54:13,140 and then I'm going to have to pay 1072 00:54:13,140 --> 00:54:17,450 a lot more for the next one. 1073 00:54:17,450 --> 00:54:18,520 Someone said dampening. 1074 00:54:18,520 --> 00:54:21,170 I like dampening because the idea behind amortization 1075 00:54:21,170 --> 00:54:25,160 is that you can take-- you have these big costs 1076 00:54:25,160 --> 00:54:27,030 and they don't occur very often. 1077 00:54:27,030 --> 00:54:30,190 So you can think of it as taking these big costs 1078 00:54:30,190 --> 00:54:31,690 and chopping them up. 1079 00:54:31,690 --> 00:54:35,170 For example, I'm going to chop this up into four 1080 00:54:35,170 --> 00:54:38,010 and I'm going to take this piece and put it here. 1081 00:54:38,010 --> 00:54:39,940 This piece and put it here. 1082 00:54:39,940 --> 00:54:42,000 This piece and put it here. 1083 00:54:42,000 --> 00:54:43,933 And then I'm going to chop this guy into two, 1084 00:54:43,933 --> 00:54:46,925 and then take this piece and put it here. 1085 00:54:46,925 --> 00:54:48,550 And the beginning's a little bit weird, 1086 00:54:48,550 --> 00:54:51,130 let's not worry about that but this guy, 1087 00:54:51,130 --> 00:54:55,350 if I chop this guy up into eight, it's going to happen, 1088 00:54:55,350 --> 00:54:57,600 is it? 1089 00:54:57,600 --> 00:55:04,180 Well we can put-- so this guy grows exponentially, right? 1090 00:55:04,180 --> 00:55:06,070 Every time it's multiplied by 2. 1091 00:55:06,070 --> 00:55:09,680 But the gap size here also is multiplied by 2. 1092 00:55:09,680 --> 00:55:13,230 So when I chop this up and I re distribute the pieces, 1093 00:55:13,230 --> 00:55:17,040 it turns out that the pieces are the same size. 1094 00:55:17,040 --> 00:55:21,254 So if I apply a dampening function that does this, 1095 00:55:21,254 --> 00:55:23,170 then the costs are going to look-- they're not 1096 00:55:23,170 --> 00:55:24,280 going to be on one, they there are 1097 00:55:24,280 --> 00:55:25,571 going to be three or something. 1098 00:55:28,690 --> 00:55:30,896 And they look like this. 1099 00:55:30,896 --> 00:55:33,020 Now, my CPU time is going to look like this, right? 1100 00:55:33,020 --> 00:55:34,770 That's not going to change, because that's 1101 00:55:34,770 --> 00:55:36,370 what's really happening. 1102 00:55:36,370 --> 00:55:38,380 But what I can argue for is that if I 1103 00:55:38,380 --> 00:55:44,060 look for a bunch of operations, say if I look at the first 16 1104 00:55:44,060 --> 00:55:50,830 insertions, the cost of those is the sum of these guys. 1105 00:55:50,830 --> 00:55:52,770 So it's not been squared, which is 1106 00:55:52,770 --> 00:55:55,520 what you would get if you look at the worst cases here, 1107 00:55:55,520 --> 00:55:56,900 but it's order n. 1108 00:55:56,900 --> 00:55:58,390 So this is what's being dampened, 1109 00:55:58,390 --> 00:56:00,692 the amount of time an operation takes. 1110 00:56:04,857 --> 00:56:05,940 Does this make some sense? 1111 00:56:09,430 --> 00:56:12,870 All right, I want to show you a cute example for amortization. 1112 00:56:12,870 --> 00:56:14,560 And I'll try to make it quick. 1113 00:56:14,560 --> 00:56:18,308 So how do you list the keys in a binary search tree in order? 1114 00:56:18,308 --> 00:56:20,940 AUDIENCE: [INAUDIBLE] 1115 00:56:20,940 --> 00:56:23,200 PROFESSOR: In order traversal, right? 1116 00:56:23,200 --> 00:56:25,170 OK, there's another way of doing it 1117 00:56:25,170 --> 00:56:27,630 that makes perfect intuitive sense. 1118 00:56:27,630 --> 00:56:30,970 Get the minimum key, right? 1119 00:56:30,970 --> 00:56:35,980 And then output the minimum key, then 1120 00:56:35,980 --> 00:56:39,282 while you can get the next largest, which 1121 00:56:39,282 --> 00:56:44,660 is the successor-- so while this is not, 1122 00:56:44,660 --> 00:56:48,379 now output that key, right? 1123 00:56:48,379 --> 00:56:50,170 If you do the thing within order traversal, 1124 00:56:50,170 --> 00:56:51,580 you get order end running time. 1125 00:56:51,580 --> 00:56:54,352 What's the running time for this? 1126 00:56:54,352 --> 00:56:55,018 AUDIENCE: For n. 1127 00:56:58,006 --> 00:57:00,452 You're going through all the keys, too. 1128 00:57:00,452 --> 00:57:01,910 PROFESSOR: Yeah, but next largest-- 1129 00:57:01,910 --> 00:57:04,500 what's the running time for next largest? 1130 00:57:04,500 --> 00:57:06,000 AUDIENCE: Log n. 1131 00:57:06,000 --> 00:57:07,660 PROFESSOR: So this guy's log in, right? 1132 00:57:10,240 --> 00:57:17,640 So I have n keys, so this whole thing is O of n logn. 1133 00:57:17,640 --> 00:57:21,140 So it's definitely not bigger than n logn. 1134 00:57:21,140 --> 00:57:24,630 But now, let's look at what happens using the tree. 1135 00:57:24,630 --> 00:57:29,190 When I call min, I go down on each edge. 1136 00:57:29,190 --> 00:57:32,600 And then I call successor and it outputs this guy. 1137 00:57:32,600 --> 00:57:35,460 Then I call successor and it goes here. 1138 00:57:35,460 --> 00:57:38,190 Than I call successor and it goes up here and here 1139 00:57:38,190 --> 00:57:39,370 and outputs this guy. 1140 00:57:39,370 --> 00:57:41,230 Successor goes here. 1141 00:57:41,230 --> 00:57:44,850 Successor goes here. 1142 00:57:44,850 --> 00:57:51,010 Successor goes all the way down here, successor goes up here, 1143 00:57:51,010 --> 00:57:54,280 successor goes here, and then successor 1144 00:57:54,280 --> 00:57:59,045 goes all the way up to the roots and gives up. 1145 00:57:59,045 --> 00:57:59,920 AUDIENCE: [INAUDIBLE] 1146 00:58:03,680 --> 00:58:07,720 PROFESSOR: So how many times do I traverse each edge? 1147 00:58:07,720 --> 00:58:08,740 Exactly twice, right? 1148 00:58:08,740 --> 00:58:11,520 How many edges in the tree? 1149 00:58:11,520 --> 00:58:15,660 If I have n nodes, how many lines do I use to connect them? 1150 00:58:15,660 --> 00:58:17,210 [INAUDIBLE] 1151 00:58:17,210 --> 00:58:20,020 So 1 node, zero lines. 1152 00:58:20,020 --> 00:58:21,370 2 nodes, one line. 1153 00:58:21,370 --> 00:58:22,860 Three nodes, two lines. 1154 00:58:22,860 --> 00:58:25,964 So n nodes, n minus one. 1155 00:58:25,964 --> 00:58:27,395 N asymptotically, good. 1156 00:58:27,395 --> 00:58:28,440 Good answer. 1157 00:58:28,440 --> 00:58:30,250 Order, n, edges. 1158 00:58:30,250 --> 00:58:33,840 Right, each edge gets traversed exactly twice. 1159 00:58:33,840 --> 00:58:41,760 So amortized cost for n next largest operations is order n. 1160 00:58:41,760 --> 00:58:44,230 So you can do this instead. 1161 00:58:44,230 --> 00:58:48,440 This code makes a lot more sense than in order traversal. 1162 00:58:48,440 --> 00:58:51,876 OK, and the last part is remember that list 1163 00:58:51,876 --> 00:58:53,000 query that was on the PSAT? 1164 00:58:58,140 --> 00:59:01,840 Turns out you can do a find for the lowest element 1165 00:59:01,840 --> 00:59:04,160 and then call successor until you see the highest 1166 00:59:04,160 --> 00:59:11,070 element for the same argument. 1167 00:59:11,070 --> 00:59:13,100 Well, I couldn't tell you this for the PSAT 1168 00:59:13,100 --> 00:59:14,975 because we hadn't learned amortized analysis, 1169 00:59:14,975 --> 00:59:18,430 so you wouldn't be able to prove that your code is fast. 1170 00:59:18,430 --> 00:59:19,950 But now if you get the intuition, 1171 00:59:19,950 --> 00:59:21,910 you can write it that way. 1172 00:59:21,910 --> 00:59:23,650 And your code will still be fast. 1173 00:59:23,650 --> 00:59:25,090 Same running time. 1174 00:59:27,970 --> 00:59:30,620 So the intuition for that is a bit more complicated. 1175 00:59:30,620 --> 00:59:31,960 The proof is more complicated. 1176 00:59:31,960 --> 00:59:36,280 But the intuition is that say this is l and this is h. 1177 00:59:36,280 --> 00:59:39,720 Then I'm going to go in this tree here. 1178 00:59:39,720 --> 00:59:42,460 So the same edge magic is going to happen, 1179 00:59:42,460 --> 00:59:46,060 except there will be logn edges that are unmatched here 1180 00:59:46,060 --> 00:59:48,790 and logn edges that aren't unmatched here. 1181 00:59:48,790 --> 00:59:50,980 Because once I find the node that's next to h, 1182 00:59:50,980 --> 00:59:51,970 I'll stop, right? 1183 00:59:51,970 --> 00:59:55,140 So some edges will not be matched. 1184 00:59:55,140 --> 00:59:59,430 So then I'll say that the total running time is logn plus a. 1185 01:00:01,669 --> 01:00:04,210 AUDIENCE: i being the number of elements you pull out, right? 1186 01:00:04,210 --> 01:00:06,410 PROFESSOR: Yup. 1187 01:00:06,410 --> 01:00:07,660 So this is amortized analysis. 1188 01:00:10,330 --> 01:00:11,960 The list is hard. 1189 01:00:11,960 --> 01:00:13,060 The traversal is easy. 1190 01:00:13,060 --> 01:00:15,730 Remember the traversal. 1191 01:00:15,730 --> 01:00:17,980 That's easy to reason about, so that's good. 1192 01:00:17,980 --> 01:00:18,793 OK. 1193 01:00:18,793 --> 01:00:20,292 Any questions on amortized analysis? 1194 01:00:23,460 --> 01:00:25,660 So the idea is that you look at all the operations, 1195 01:00:25,660 --> 01:00:27,810 you don't look at one operation at a time. 1196 01:00:27,810 --> 01:00:31,310 And you're trying to see if I look at everything, 1197 01:00:31,310 --> 01:00:33,780 is it the case that I have some really fast operations 1198 01:00:33,780 --> 01:00:35,490 and the slow operations don't happen 1199 01:00:35,490 --> 01:00:37,580 too much, because if that's the case, 1200 01:00:37,580 --> 01:00:40,990 then I can make an argument for the average cost, which 1201 01:00:40,990 --> 01:00:44,780 is better than the argument that says this is the worst 1202 01:00:44,780 --> 01:00:46,820 case of an operation, I'm doing an operation, 1203 01:00:46,820 --> 01:00:49,254 the total cost is n times the worst cost. 1204 01:00:52,740 --> 01:00:54,910 Make some sense? 1205 01:00:54,910 --> 01:00:55,580 OK. 1206 01:00:55,580 --> 01:00:57,630 Cool. 1207 01:00:57,630 --> 01:00:58,130 All right. 1208 01:00:58,130 --> 01:01:00,580 Have fun at the next p set.