1 00:00:00,080 --> 00:00:01,770 The following content is provided 2 00:00:01,770 --> 00:00:04,010 under a Creative Commons license. 3 00:00:04,010 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue 4 00:00:06,860 --> 00:00:10,720 to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,207 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,207 --> 00:00:17,832 at ocw.mit.edu. 8 00:00:22,120 --> 00:00:26,120 PROFESSOR: A trilogy, if you will, on hashing. 9 00:00:26,120 --> 00:00:28,230 We did a lot of cool hashing stuff. 10 00:00:28,230 --> 00:00:31,320 In some sense, we already have what we want with hashing. 11 00:00:31,320 --> 00:00:37,220 Hashing with chaining, we can do constant expected time, 12 00:00:37,220 --> 00:00:43,610 I should say, constant as long as-- yeah. 13 00:00:43,610 --> 00:00:47,710 If we're doing insert, delete, and exact search. 14 00:00:47,710 --> 00:00:48,750 Is this key in there? 15 00:00:48,750 --> 00:00:49,730 If so, return the item. 16 00:00:49,730 --> 00:00:52,170 Otherwise, say no. 17 00:00:52,170 --> 00:00:54,390 And we do that with hashing with chaining. 18 00:00:54,390 --> 00:00:58,760 Under the analysis we did was with simple uniform hashing. 19 00:00:58,760 --> 00:01:01,140 An alternative is to use universal hashing, which 20 00:01:01,140 --> 00:01:02,360 is not really in this class. 21 00:01:02,360 --> 00:01:07,970 But if you find this weird, then this is less weird. 22 00:01:07,970 --> 00:01:10,590 And hashing with chaining, the idea 23 00:01:10,590 --> 00:01:13,950 was we had this giant universe of all keys, could be actually 24 00:01:13,950 --> 00:01:14,950 all integers. 25 00:01:14,950 --> 00:01:17,141 So it's infinite. 26 00:01:17,141 --> 00:01:19,390 But then what we actually are storing in our structure 27 00:01:19,390 --> 00:01:21,980 is some finite set of n keys. 28 00:01:21,980 --> 00:01:25,000 Here, I'm labeling them k1 through k4, n is four. 29 00:01:25,000 --> 00:01:29,490 But in general, you don't know what they're going to be. 30 00:01:29,490 --> 00:01:32,970 We reduce that to a table of size m 31 00:01:32,970 --> 00:01:36,260 by this hash function h-- stuff drawn in red. 32 00:01:36,260 --> 00:01:37,980 And so here I have a three way collision. 33 00:01:37,980 --> 00:01:40,430 These three keys all map to one, and so I 34 00:01:40,430 --> 00:01:44,560 store a linked list of k1, k4 and k2. 35 00:01:44,560 --> 00:01:46,040 They're in no particular order. 36 00:01:46,040 --> 00:01:47,880 That's the point of that picture. 37 00:01:47,880 --> 00:01:50,719 Here k3 happens to map to its own slot. 38 00:01:50,719 --> 00:01:52,510 And the other slots are empty, so they just 39 00:01:52,510 --> 00:01:57,250 have a null saying there's an empty linked list there. 40 00:01:57,250 --> 00:02:01,830 Total size of this structure is n plus m. 41 00:02:01,830 --> 00:02:04,520 There's m to store the table. 42 00:02:04,520 --> 00:02:08,280 There's n to store the sum of the lengths of all the lists 43 00:02:08,280 --> 00:02:10,460 is going to be n. 44 00:02:10,460 --> 00:02:15,170 And then we said the expected chain length, 45 00:02:15,170 --> 00:02:18,210 if everything's uniform, then the probability 46 00:02:18,210 --> 00:02:22,140 of a particular key going to a particular slot is 1/m. 47 00:02:22,140 --> 00:02:24,650 And if everything's nice and independent 48 00:02:24,650 --> 00:02:26,570 or if you use universal hashing, you 49 00:02:26,570 --> 00:02:32,020 can show that the total expected chain length is n/m. 50 00:02:32,020 --> 00:02:34,520 n independent trials, each probability 1/m 51 00:02:34,520 --> 00:02:35,970 of falling here. 52 00:02:35,970 --> 00:02:39,500 And we call that alpha, the load factor. 53 00:02:39,500 --> 00:02:42,010 And we concluded that the operation time 54 00:02:42,010 --> 00:02:45,725 to do an insert, delete, or search was order 1 plus alpha. 55 00:02:48,820 --> 00:02:50,940 So that's n expectation. 56 00:02:50,940 --> 00:02:52,560 So that was hashing with chaining. 57 00:02:52,560 --> 00:02:53,680 This is good news. 58 00:02:53,680 --> 00:02:57,540 As long as alpha is a constant, we get constant time. 59 00:02:57,540 --> 00:03:01,470 And just for recollection, today we're 60 00:03:01,470 --> 00:03:03,970 not really going to be thinking too much about what the hash 61 00:03:03,970 --> 00:03:06,680 function is, but just remember two of them I talked 62 00:03:06,680 --> 00:03:10,890 about-- this one we actually will use today, where you just 63 00:03:10,890 --> 00:03:12,480 take the key and take it module m. 64 00:03:12,480 --> 00:03:15,480 That's one easy way of mapping all integers into the space 65 00:03:15,480 --> 00:03:17,174 zero through m minus 1. 66 00:03:17,174 --> 00:03:18,590 That's called the division method. 67 00:03:18,590 --> 00:03:20,850 Multiplication method is more fancy. 68 00:03:20,850 --> 00:03:22,990 You multiply by a random integer, 69 00:03:22,990 --> 00:03:26,210 and then you look at the middle of that multiplication. 70 00:03:26,210 --> 00:03:29,690 And that's where lots of copies of the key k 71 00:03:29,690 --> 00:03:33,980 get mixed up together and that's sort of the name of hashing. 72 00:03:33,980 --> 00:03:39,070 And that's a better hash function in the real world. 73 00:03:39,070 --> 00:03:40,750 So that's hashing with chaining. 74 00:03:40,750 --> 00:03:41,610 Cool? 75 00:03:41,610 --> 00:03:45,430 Now, it seemed like a complete picture, 76 00:03:45,430 --> 00:03:48,210 but there's one crucial thing that we're missing here. 77 00:03:48,210 --> 00:03:50,410 Any suggestions? 78 00:03:50,410 --> 00:03:55,490 If I went to go to implement this data structure, 79 00:03:55,490 --> 00:03:56,930 what don't I know how to do? 80 00:04:00,150 --> 00:04:01,930 And one answer could be the hash function, 81 00:04:01,930 --> 00:04:03,755 but we're going to ignore that. 82 00:04:03,755 --> 00:04:04,880 I know you know the answer. 83 00:04:04,880 --> 00:04:06,490 Does anyone else know the answer? 84 00:04:06,490 --> 00:04:07,270 Yeah. 85 00:04:07,270 --> 00:04:08,477 AUDIENCE: Grow the table. 86 00:04:08,477 --> 00:04:09,560 PROFESSOR: Grow the table. 87 00:04:09,560 --> 00:04:10,070 Yeah. 88 00:04:10,070 --> 00:04:14,650 The question is, what should m be? 89 00:04:14,650 --> 00:04:17,390 OK, we have to create a table size m, 90 00:04:17,390 --> 00:04:19,240 and we put our keys into it. 91 00:04:19,240 --> 00:04:23,044 We know we'd like m to be about the same as n. 92 00:04:23,044 --> 00:04:24,460 But the trouble is we don't really 93 00:04:24,460 --> 00:04:26,962 know n because insertions come along, 94 00:04:26,962 --> 00:04:28,670 and then we might have to grow the table. 95 00:04:28,670 --> 00:04:31,910 If n gets really big relative to m, 96 00:04:31,910 --> 00:04:35,100 we're in trouble because this factor will go up 97 00:04:35,100 --> 00:04:37,380 and it will be no longer constant time. 98 00:04:37,380 --> 00:04:40,600 The other hand, if we set m to be really big, 99 00:04:40,600 --> 00:04:42,855 we're also kind of wasteful. 100 00:04:42,855 --> 00:04:44,230 The whole point of this structure 101 00:04:44,230 --> 00:04:47,950 was to avoid having one slot for every possible key 102 00:04:47,950 --> 00:04:50,450 because that was giant. 103 00:04:50,450 --> 00:04:52,680 We want it to save space. 104 00:04:52,680 --> 00:04:55,440 So we want m to be big enough that our structure is fast, 105 00:04:55,440 --> 00:04:58,550 but small enough that it's not wasteful in space. 106 00:04:58,550 --> 00:05:00,345 And so that's the remaining question. 107 00:05:12,710 --> 00:05:16,170 We want m to be theta n. 108 00:05:16,170 --> 00:05:19,200 We want it to be omega n. 109 00:05:19,200 --> 00:05:21,620 So we want it to be at least some constant times n, 110 00:05:21,620 --> 00:05:24,100 in order to make alpha be a constant. 111 00:05:24,100 --> 00:05:26,290 And we want it to be big O of n in order 112 00:05:26,290 --> 00:05:27,490 to make the space linear. 113 00:05:36,620 --> 00:05:39,800 And the way we're going to do this, as we suggested, 114 00:05:39,800 --> 00:05:41,160 is to grow the table. 115 00:05:46,630 --> 00:05:48,970 We're going to start with m equals some constant. 116 00:05:48,970 --> 00:05:50,136 Pick your favorite constant. 117 00:05:52,370 --> 00:05:54,020 That's 20. 118 00:05:54,020 --> 00:05:55,890 My favorite constant's 7. 119 00:05:55,890 --> 00:06:00,440 Probably want it to be a power of two, but what the hell? 120 00:06:00,440 --> 00:06:06,580 And then we're going to grow and shrink as necessary. 121 00:06:06,580 --> 00:06:09,520 This is a pretty obvious idea. 122 00:06:09,520 --> 00:06:12,450 The interesting part is to get it to work. 123 00:06:12,450 --> 00:06:16,370 And it's going to introduce a whole new concept, which 124 00:06:16,370 --> 00:06:17,085 is amortization. 125 00:06:20,340 --> 00:06:22,610 So it's going to be cool. 126 00:06:22,610 --> 00:06:24,450 Trust me. 127 00:06:24,450 --> 00:06:26,630 Not only are we going to solve this problem of how 128 00:06:26,630 --> 00:06:31,220 to choose m, we're also going to figure out how the Python data 129 00:06:31,220 --> 00:06:34,840 structure called list, also known as array, is implemented. 130 00:06:34,840 --> 00:06:37,650 So it's the exactly the same problem. 131 00:06:37,650 --> 00:06:39,730 I'll get to that in a moment. 132 00:06:39,730 --> 00:06:48,720 So for example, let's say that we-- I said m 133 00:06:48,720 --> 00:06:50,170 should be theta n. 134 00:06:50,170 --> 00:06:55,550 Let's say we want m to be at least n at all times. 135 00:06:55,550 --> 00:06:58,580 So what happens, we start with m equals 8. 136 00:06:58,580 --> 00:07:02,620 And so, let's say we start with an empty hash 137 00:07:02,620 --> 00:07:04,320 table, an empty dictionary. 138 00:07:04,320 --> 00:07:06,710 And then I insert eight things. 139 00:07:06,710 --> 00:07:09,280 And then I go to insert the ninth thing. 140 00:07:09,280 --> 00:07:11,170 And I say, oh, now m is bigger than n. 141 00:07:11,170 --> 00:07:12,570 What should I do? 142 00:07:20,030 --> 00:07:25,190 So this would be like at the end of an insertion algorithm. 143 00:07:25,190 --> 00:07:27,820 After I insert something and say oh, if m is greater than n, 144 00:07:27,820 --> 00:07:30,610 then I'm getting worried that m is getting much bigger than n. 145 00:07:30,610 --> 00:07:33,370 So I'd like to grow the table. 146 00:07:33,370 --> 00:07:33,970 OK? 147 00:07:33,970 --> 00:07:37,530 Let's take a little diversion to what does grow a table mean. 148 00:07:44,020 --> 00:07:48,170 So maybe I have current size m and I'd 149 00:07:48,170 --> 00:07:51,707 like to go to a new size, m prime. 150 00:07:51,707 --> 00:07:54,040 This would actually work if you're growing or shrinking, 151 00:07:54,040 --> 00:07:58,130 but m could be bigger or smaller than m prime. 152 00:07:58,130 --> 00:07:59,820 What should I do-- what do I need 153 00:07:59,820 --> 00:08:03,050 to do in order to build a new table of this size? 154 00:08:07,680 --> 00:08:09,430 Easy warm up. 155 00:08:09,430 --> 00:08:10,360 Yeah? 156 00:08:10,360 --> 00:08:13,284 AUDIENCE: Allocate the memory and then rehash [INAUDIBLE]. 157 00:08:13,284 --> 00:08:13,950 PROFESSOR: Yeah. 158 00:08:13,950 --> 00:08:16,060 Allocate the memory and rehash. 159 00:08:16,060 --> 00:08:17,860 So we have all these keys. 160 00:08:17,860 --> 00:08:20,660 They're stored with some hash function in here, 161 00:08:20,660 --> 00:08:22,330 in table size m. 162 00:08:22,330 --> 00:08:24,070 I need to build an entirely new table, 163 00:08:24,070 --> 00:08:28,205 size m prime, and then I need to rehash everything. 164 00:08:54,660 --> 00:09:02,180 One way to think of this is for each item in the old table, 165 00:09:02,180 --> 00:09:11,490 insert into the new table, T prime. 166 00:09:11,490 --> 00:09:15,820 I think that's worth a cushion. 167 00:09:15,820 --> 00:09:17,360 You got one? 168 00:09:17,360 --> 00:09:18,791 You don't want to get hit. 169 00:09:18,791 --> 00:09:19,290 It's fine. 170 00:09:19,290 --> 00:09:21,540 We're not burning through these questions fast enough, 171 00:09:21,540 --> 00:09:22,740 so answer more questions. 172 00:09:25,321 --> 00:09:25,820 OK. 173 00:09:25,820 --> 00:09:27,661 So how much time does this take? 174 00:09:27,661 --> 00:09:29,285 That's the main point of this exercise. 175 00:09:38,011 --> 00:09:39,500 Yeah? 176 00:09:39,500 --> 00:09:40,600 AUDIENCE: Order n. 177 00:09:40,600 --> 00:09:42,160 PROFESSOR: Order n. 178 00:09:42,160 --> 00:09:46,060 Yeah, I think as long as m and m prime are theta n, 179 00:09:46,060 --> 00:09:47,720 this is order n. 180 00:09:47,720 --> 00:09:53,270 In general, it's going to be n plus m plus m prime, 181 00:09:53,270 --> 00:09:54,006 but you're right. 182 00:09:54,006 --> 00:09:55,380 Most of the time that's-- I mean, 183 00:09:55,380 --> 00:09:57,440 in the situation we're going to construct, 184 00:09:57,440 --> 00:10:00,310 this will be theta n. 185 00:10:00,310 --> 00:10:03,050 But in general, there's this issue that, for example, 186 00:10:03,050 --> 00:10:04,960 to iterate over every item in the table you 187 00:10:04,960 --> 00:10:06,410 have to look at every slot. 188 00:10:06,410 --> 00:10:08,240 And so you have to pay order m just 189 00:10:08,240 --> 00:10:12,290 to visit every slot, order n to visit all those lists, m 190 00:10:12,290 --> 00:10:15,850 prime just to build the new table, size m prime. 191 00:10:15,850 --> 00:10:18,110 Initialize it all to nil. 192 00:10:18,110 --> 00:10:18,610 Good. 193 00:10:22,484 --> 00:10:23,900 I guess another main point here is 194 00:10:23,900 --> 00:10:25,860 that we have to build a new hash function. 195 00:10:25,860 --> 00:10:27,693 Why do we need to build a new hash function? 196 00:10:27,693 --> 00:10:32,200 Because the hash function-- why did I call it f prime? 197 00:10:32,200 --> 00:10:33,240 Calling it h prime. 198 00:10:37,300 --> 00:10:40,210 The hash function is all about mapping the universe of keys 199 00:10:40,210 --> 00:10:42,160 to a table of size m. 200 00:10:42,160 --> 00:10:44,556 So if m changes, we definitely need a new hash function. 201 00:10:44,556 --> 00:10:45,930 If you use the old hash function, 202 00:10:45,930 --> 00:10:48,180 you would just use the beginning of the table. 203 00:10:48,180 --> 00:10:51,610 If you add more slots down here, you're not going to use them. 204 00:10:51,610 --> 00:10:53,300 For every key you've got to rehash it, 205 00:10:53,300 --> 00:10:54,920 figure out where it goes. 206 00:10:54,920 --> 00:10:57,960 I think I've drilled that home enough times. 207 00:10:57,960 --> 00:11:04,347 So the question becomes when we see that our table is too big, 208 00:11:04,347 --> 00:11:05,430 we need to make it bigger. 209 00:11:05,430 --> 00:11:08,440 But how much bigger? 210 00:11:08,440 --> 00:11:10,156 Suggestions? 211 00:11:10,156 --> 00:11:11,122 Yeah? 212 00:11:11,122 --> 00:11:12,090 AUDIENCE: 2x. 213 00:11:12,090 --> 00:11:13,560 PROFESSOR: 2x. 214 00:11:13,560 --> 00:11:14,474 Twice m. 215 00:11:14,474 --> 00:11:15,140 Good suggestion. 216 00:11:15,140 --> 00:11:16,056 Any other suggestions? 217 00:11:18,331 --> 00:11:18,830 3x? 218 00:11:21,340 --> 00:11:24,090 OK. 219 00:11:24,090 --> 00:11:27,910 m prime equals 2 m is the correct answer. 220 00:11:27,910 --> 00:11:31,430 But for fun, or for pain I guess, 221 00:11:31,430 --> 00:11:33,660 let's think about the wrong answer, which would be, 222 00:11:33,660 --> 00:11:35,690 just make it one bigger. 223 00:11:35,690 --> 00:11:37,950 That'll make m equal to n again, so that 224 00:11:37,950 --> 00:11:39,790 seems-- it's at least safe. 225 00:11:39,790 --> 00:11:43,250 It will maintain my invariant that m is at least n. 226 00:11:43,250 --> 00:11:48,510 I get this wrong-- sorry, that's the wrong way. 227 00:11:48,510 --> 00:11:50,470 n is greater than m. 228 00:11:50,470 --> 00:11:54,700 I want m to be greater than or equal to n. 229 00:11:54,700 --> 00:11:58,160 So if we just incremented our table size, 230 00:11:58,160 --> 00:12:02,795 then the question becomes, what is the cost of n insertions? 231 00:12:05,590 --> 00:12:08,180 So say we start with an empty table 232 00:12:08,180 --> 00:12:10,120 and it has size eight or whatever, 233 00:12:10,120 --> 00:12:15,100 some constant, and we insert n times. 234 00:12:15,100 --> 00:12:17,540 Then after eight insertions when we insert we 235 00:12:17,540 --> 00:12:19,220 have to rebuild our entire table. 236 00:12:19,220 --> 00:12:20,710 That takes linear time. 237 00:12:20,710 --> 00:12:22,990 After we insert one more, we have to rebuild. 238 00:12:22,990 --> 00:12:24,520 That takes linear time. 239 00:12:24,520 --> 00:12:28,500 And so the cost is going to be something like, 240 00:12:28,500 --> 00:12:32,365 after you get to 8, it's going to be 1 plus 2 plus 3 plus 4. 241 00:12:32,365 --> 00:12:34,540 So a triangular number. 242 00:12:34,540 --> 00:12:38,690 Every time we insert, we have to rebuild everything. 243 00:12:38,690 --> 00:12:41,035 So this is quadratic, this is bad. 244 00:12:45,970 --> 00:12:50,940 Fortunately, if all we do is double m, we're golden. 245 00:12:50,940 --> 00:12:53,890 And this is sort of the point of why 246 00:12:53,890 --> 00:12:56,862 it's called table-- I call it table resizing there. 247 00:12:56,862 --> 00:12:58,320 Or to not give it away, but this is 248 00:12:58,320 --> 00:12:59,850 a technique called table doubling. 249 00:13:02,380 --> 00:13:05,960 And let's just think of the cost of n insertions. 250 00:13:05,960 --> 00:13:07,200 There's also deletions. 251 00:13:07,200 --> 00:13:10,060 But if we just, again, start with an empty table, 252 00:13:10,060 --> 00:13:12,470 and we repeatedly insert, then the cost 253 00:13:12,470 --> 00:13:19,460 we get-- if we double each time and we're inserting, 254 00:13:19,460 --> 00:13:23,550 after we get to 8, we insert, we double to 16. 255 00:13:23,550 --> 00:13:26,540 Then we insert eight more times, then we double to 32. 256 00:13:26,540 --> 00:13:30,540 Then we insert 16 times, then we double to 64. 257 00:13:30,540 --> 00:13:32,180 All these numbers are roughly the same. 258 00:13:32,180 --> 00:13:34,380 They're within a factor of two of each other. 259 00:13:34,380 --> 00:13:36,670 Every time we're rebuilding in linear time, 260 00:13:36,670 --> 00:13:40,990 but we're only doing it like log end times. 261 00:13:40,990 --> 00:13:44,560 If we're going from one to n, their log end growths-- 262 00:13:44,560 --> 00:13:47,030 log end doublings that we're able to do. 263 00:13:47,030 --> 00:13:49,887 So you might think, oh, it's n log n. 264 00:13:49,887 --> 00:13:50,970 But we don't want n log n. 265 00:13:50,970 --> 00:13:52,386 That would be binary search trees. 266 00:13:52,386 --> 00:13:54,430 We want to do better than n log n. 267 00:13:54,430 --> 00:13:57,040 If you think about the costs here, 268 00:13:57,040 --> 00:14:01,467 the cost to rebuild the first time is concepts, like 8. 269 00:14:01,467 --> 00:14:03,300 And then the cost to rebuild the second time 270 00:14:03,300 --> 00:14:06,310 is 16, so twice that. 271 00:14:06,310 --> 00:14:10,490 The cost to build the next time is 64. 272 00:14:10,490 --> 00:14:14,240 So these go up geometrically. 273 00:14:14,240 --> 00:14:17,200 You've got to get from 1 to n with log end steps. 274 00:14:17,200 --> 00:14:18,910 The natural way to do it is by doubling, 275 00:14:18,910 --> 00:14:22,230 and you can prove that indeed this is the case. 276 00:14:22,230 --> 00:14:24,490 So this is a geometric series. 277 00:14:24,490 --> 00:14:26,140 Didn't mean to cross it out there. 278 00:14:26,140 --> 00:14:27,473 And so this is theta n. 279 00:14:30,379 --> 00:14:32,670 Now, it's a little strange to be talking about theta n. 280 00:14:32,670 --> 00:14:34,045 This is a data structure supposed 281 00:14:34,045 --> 00:14:36,640 to be constant time per operation. 282 00:14:36,640 --> 00:14:40,520 This data structure is not constant time per operation. 283 00:14:40,520 --> 00:14:43,610 Even ignoring all the hashing business, 284 00:14:43,610 --> 00:14:45,630 all you're trying to do is grow a table. 285 00:14:45,630 --> 00:14:48,820 It takes more than constant time for some operations. 286 00:14:48,820 --> 00:14:52,760 Near the end, you have to rebuild the last time, 287 00:14:52,760 --> 00:14:55,330 you're restructuring the entire table. 288 00:14:55,330 --> 00:14:59,470 That take linear time for one operation. 289 00:14:59,470 --> 00:15:01,970 You might say that's bad. 290 00:15:01,970 --> 00:15:03,615 But the comforting thing is that there 291 00:15:03,615 --> 00:15:05,640 are only a few operations, log end of them, 292 00:15:05,640 --> 00:15:06,949 that are really expensive. 293 00:15:06,949 --> 00:15:08,240 The rest are all constant time. 294 00:15:08,240 --> 00:15:09,156 You don't do anything. 295 00:15:09,156 --> 00:15:11,560 You just add into the table. 296 00:15:11,560 --> 00:15:14,565 So this is an idea we call amortization. 297 00:15:24,000 --> 00:15:26,865 Maybe I should write here-- we call this table doubling. 298 00:15:38,085 --> 00:15:40,585 So the idea with amortization, let me give you a definition. 299 00:15:58,618 --> 00:16:00,690 Actually, I'm going to be a little bit vague here 300 00:16:00,690 --> 00:16:03,507 and just say-- T of n. 301 00:16:03,507 --> 00:16:05,006 Let me see what it says in my notes. 302 00:16:08,910 --> 00:16:09,520 Yeah. 303 00:16:09,520 --> 00:16:10,270 I say T of n. 304 00:16:37,590 --> 00:16:40,980 So we're going to use a concept of-- usually 305 00:16:40,980 --> 00:16:43,890 we say running time is T of n. 306 00:16:43,890 --> 00:16:47,310 And we started saying the expected running time 307 00:16:47,310 --> 00:16:51,790 is some T of n plus alpha or whatever. 308 00:16:51,790 --> 00:16:55,050 Now, we're going to be able to say the amortized running time 309 00:16:55,050 --> 00:16:59,440 is T of n, or the running time is T of n amortized. 310 00:16:59,440 --> 00:17:01,490 That's what this is saying. 311 00:17:01,490 --> 00:17:04,569 And what that means is that it's not 312 00:17:04,569 --> 00:17:06,337 any statement about the individual running 313 00:17:06,337 --> 00:17:07,295 time of the operations. 314 00:17:07,295 --> 00:17:11,040 It's saying if you do a whole bunch of operations, k of them, 315 00:17:11,040 --> 00:17:15,849 then the total running time is, at most, k times T of n. 316 00:17:15,849 --> 00:17:18,190 This is a way to amortize, or to-- yeah, 317 00:17:18,190 --> 00:17:22,050 amortize-- this is in the economic sense of amortize, 318 00:17:22,050 --> 00:17:22,810 I guess. 319 00:17:22,810 --> 00:17:27,470 You spread out the high costs so that's it's cheap on average 320 00:17:27,470 --> 00:17:28,700 all the time. 321 00:17:28,700 --> 00:17:32,640 It's kind of like-- normally, we pay rent every month. 322 00:17:32,640 --> 00:17:34,780 But you could think of it instead as you're only 323 00:17:34,780 --> 00:17:39,490 paying $50 a day or something for your monthly rent. 324 00:17:39,490 --> 00:17:43,624 It's maybe-- if you want to smooth things out, 325 00:17:43,624 --> 00:17:45,790 that would be a nice way to think about paying rent, 326 00:17:45,790 --> 00:17:48,790 or every second you're paying a penny or something. 327 00:17:52,610 --> 00:17:54,830 It's close, actually. 328 00:17:54,830 --> 00:17:57,350 Little bit off, factor or two. 329 00:17:57,350 --> 00:17:59,920 Anyway, so that's the idea. 330 00:17:59,920 --> 00:18:11,944 So you can think of-- this is kind 331 00:18:11,944 --> 00:18:14,110 of like saying that the running time of an operation 332 00:18:14,110 --> 00:18:16,530 is T of n on average. 333 00:18:16,530 --> 00:18:17,600 But put that in quotes. 334 00:18:17,600 --> 00:18:21,280 We don't usually use that terminology. 335 00:18:21,280 --> 00:18:22,550 Maybe put a Tilda here. 336 00:18:25,440 --> 00:18:31,425 Where the average is taken over all the operations. 337 00:18:36,641 --> 00:18:38,140 So this is something that only makes 338 00:18:38,140 --> 00:18:39,380 sense for data structures. 339 00:18:39,380 --> 00:18:42,780 Data structures are things that have lots of operations on them 340 00:18:42,780 --> 00:18:44,230 over time. 341 00:18:44,230 --> 00:18:47,550 And if you just-- instead of counting individual operation 342 00:18:47,550 --> 00:18:50,820 times and then adding them up, if you add them up and then 343 00:18:50,820 --> 00:18:52,842 divide by the number of operations, 344 00:18:52,842 --> 00:18:54,300 that's your amortized running time. 345 00:18:54,300 --> 00:18:58,740 So the point is, in table doubling, 346 00:18:58,740 --> 00:19:06,120 the amortized running time is beta 1. 347 00:19:06,120 --> 00:19:08,865 Because it's n in total-- at this point 348 00:19:08,865 --> 00:19:10,460 we've only analyzed insertions. 349 00:19:10,460 --> 00:19:11,880 We haven't talked about deletions. 350 00:19:23,340 --> 00:19:24,570 So k inserts. 351 00:19:28,480 --> 00:19:33,250 If we're just doing insertions, take beta k time in total. 352 00:19:33,250 --> 00:19:44,415 So this means constant amortized per insert. 353 00:19:44,415 --> 00:19:46,820 OK, it's a simple idea, but a useful one 354 00:19:46,820 --> 00:19:50,760 because typically-- unless you're in like a real time 355 00:19:50,760 --> 00:19:54,684 system-- you typically only care about the overall running 356 00:19:54,684 --> 00:19:56,600 time of your algorithm, which might use a data 357 00:19:56,600 --> 00:19:58,539 structure as a sub routine. 358 00:19:58,539 --> 00:20:00,330 You don't care if individual operations are 359 00:20:00,330 --> 00:20:04,620 expensive as long as all the operations together are cheap. 360 00:20:04,620 --> 00:20:06,790 You're using hashing to solve some other problem, 361 00:20:06,790 --> 00:20:10,714 like counting duplicate words in doc dist. 362 00:20:10,714 --> 00:20:13,130 You just care about the running time of counting duplicate 363 00:20:13,130 --> 00:20:13,630 words. 364 00:20:13,630 --> 00:20:17,040 You don't care about how long each step of the for loop 365 00:20:17,040 --> 00:20:20,420 takes, just the aggregate. 366 00:20:20,420 --> 00:20:22,020 So this is good most of the time. 367 00:20:22,020 --> 00:20:24,550 And we've proved it for insertions. 368 00:20:24,550 --> 00:20:30,430 It's also true when you have deletions. 369 00:20:30,430 --> 00:20:31,870 You have k inserts and deletes. 370 00:20:38,530 --> 00:20:42,366 They certainly take order k time. 371 00:20:42,366 --> 00:20:44,240 Actually, this is easy to prove at this point 372 00:20:44,240 --> 00:20:45,809 because we haven't changed delete. 373 00:20:45,809 --> 00:20:47,850 So, what delete does is it just deletes something 374 00:20:47,850 --> 00:20:49,965 from the table, leaves the table the same size. 375 00:20:54,020 --> 00:20:56,750 And so it actually makes life better for us 376 00:20:56,750 --> 00:21:01,210 because if it decreases m, in order to make m big again, 377 00:21:01,210 --> 00:21:03,820 you have to do more insertions than you had to before. 378 00:21:03,820 --> 00:21:06,120 And the only extra cost we're thinking about here 379 00:21:06,120 --> 00:21:10,890 is the growing, the rebuild cost from inserting too big. 380 00:21:10,890 --> 00:21:12,220 And so this is still true. 381 00:21:12,220 --> 00:21:14,740 Deletions only help us. 382 00:21:14,740 --> 00:21:19,022 If you have k total inserts and deletes, then still be order k. 383 00:21:19,022 --> 00:21:20,355 So still get constant amortized. 384 00:21:23,956 --> 00:21:26,980 But this is not totally satisfying 385 00:21:26,980 --> 00:21:30,330 because of table might get big again. 386 00:21:30,330 --> 00:21:32,800 m might become much larger than n. 387 00:21:32,800 --> 00:21:35,710 For example, suppose I do n inserts 388 00:21:35,710 --> 00:21:37,820 and then I do n deletes. 389 00:21:37,820 --> 00:21:41,070 So now I have an empty table, n equals 0, 390 00:21:41,070 --> 00:21:44,990 but m is going to be around the original value of n, 391 00:21:44,990 --> 00:21:47,280 or the maximum value of n over time. 392 00:21:50,050 --> 00:21:54,710 So we can fix that. 393 00:21:54,710 --> 00:21:56,160 Suggestions on how to fix that? 394 00:22:00,860 --> 00:22:03,040 This is a little more subtle. 395 00:22:03,040 --> 00:22:04,550 There's two obvious answers. 396 00:22:04,550 --> 00:22:08,460 One is correct and the other is incorrect. 397 00:22:08,460 --> 00:22:09,025 Yeah? 398 00:22:09,025 --> 00:22:09,900 AUDIENCE: [INAUDIBLE] 399 00:22:14,220 --> 00:22:15,660 PROFESSOR: Good. 400 00:22:15,660 --> 00:22:23,440 So option one is if the table becomes half the size, 401 00:22:23,440 --> 00:22:30,980 then shrink-- to half the size? 402 00:22:30,980 --> 00:22:31,480 Sure. 403 00:22:37,390 --> 00:22:38,507 OK. 404 00:22:38,507 --> 00:22:39,590 That's on the right track. 405 00:22:39,590 --> 00:22:42,288 Anyone see a problem with that? 406 00:22:42,288 --> 00:22:43,240 Yeah? 407 00:22:43,240 --> 00:22:45,790 AUDIENCE: [INAUDIBLE] when you're going from like 8 to 9, 408 00:22:45,790 --> 00:22:47,623 you can go from 8 to 9, 9 to 8, [INAUDIBLE]. 409 00:22:47,623 --> 00:22:48,710 PROFESSOR: Good. 410 00:22:48,710 --> 00:22:57,150 So if you're sizing and say you have eight items in your table, 411 00:22:57,150 --> 00:23:01,390 you add a ninth item and so you double to 16. 412 00:23:01,390 --> 00:23:03,820 Then you delete that ninth item, you're back to eight. 413 00:23:03,820 --> 00:23:06,440 And then you say oh, now m equals n/2, 414 00:23:06,440 --> 00:23:08,620 so I'm going to shrink to half the size. 415 00:23:08,620 --> 00:23:10,960 And if I insert again-- delete, insert, delete, 416 00:23:10,960 --> 00:23:15,014 insert-- I spend linear time for every operation. 417 00:23:15,014 --> 00:23:15,930 So that's the problem. 418 00:23:18,810 --> 00:23:20,133 This is slow. 419 00:23:22,690 --> 00:23:27,830 If we go from 2 to the k to 2 to the k plus 1, 420 00:23:27,830 --> 00:23:31,950 we go this way via-- oh sorry, 2 to the k plus 1. 421 00:23:31,950 --> 00:23:36,010 Then, I said it right, insert to go to the right, 422 00:23:36,010 --> 00:23:37,560 delete to go to the left. 423 00:23:37,560 --> 00:23:39,630 Then we'll get linear time for operation. 424 00:23:44,550 --> 00:23:46,820 That is that. 425 00:23:46,820 --> 00:23:48,948 So, how do we fix this? 426 00:23:48,948 --> 00:23:50,310 Yeah. 427 00:23:50,310 --> 00:23:52,580 AUDIENCE: Maybe m equal m/3 or something? 428 00:23:52,580 --> 00:23:53,910 PROFESSOR: M equals n over 3. 429 00:23:53,910 --> 00:23:54,783 Yep. 430 00:23:54,783 --> 00:23:56,699 AUDIENCE: And then still leave it [INAUDIBLE]. 431 00:24:04,926 --> 00:24:05,819 PROFESSOR: Good. 432 00:24:05,819 --> 00:24:07,360 I'm going to do 4, if you don't mind. 433 00:24:07,360 --> 00:24:08,680 I'll keep it powers of 2. 434 00:24:08,680 --> 00:24:10,240 Any number bigger than 3 will work-- 435 00:24:10,240 --> 00:24:13,970 or any number bigger than 2 will work here. 436 00:24:13,970 --> 00:24:17,700 But it's kind of nice to stick to powers of two. 437 00:24:17,700 --> 00:24:18,380 Just for fun. 438 00:24:18,380 --> 00:24:20,463 I mean, doesn't really matter because, as you say, 439 00:24:20,463 --> 00:24:22,910 we're still going to shrink to half the size, 440 00:24:22,910 --> 00:24:27,350 but we're only going to trigger it when we are 3/4 empty. 441 00:24:27,350 --> 00:24:29,290 We're only using a quarter of the space. 442 00:24:29,290 --> 00:24:31,130 Then, it turns out you can afford 443 00:24:31,130 --> 00:24:34,080 to shrink to half the size because in order 444 00:24:34,080 --> 00:24:36,340 to explode again, in order to need to grow again, 445 00:24:36,340 --> 00:24:42,900 you have to still insert n over m-- m over 2 items. 446 00:24:42,900 --> 00:24:44,060 Because it's half empty. 447 00:24:44,060 --> 00:24:46,360 So when you're only a quarter full, 448 00:24:46,360 --> 00:24:49,740 you shrink to become a half full because then to grow again 449 00:24:49,740 --> 00:24:50,950 requires a lot of insertions. 450 00:24:50,950 --> 00:24:53,270 I haven't proved anything here, but it turns out 451 00:24:53,270 --> 00:25:02,430 if you do this, the amortized time becomes constant. 452 00:25:05,050 --> 00:25:09,360 For k insertions and deletions, arbitrary combination, 453 00:25:09,360 --> 00:25:12,130 you'll maintain linear size because of these two-- 454 00:25:12,130 --> 00:25:13,840 because you're maintaining the invariant 455 00:25:13,840 --> 00:25:21,830 that m is between n and 4n. 456 00:25:24,724 --> 00:25:25,890 You maintain that invariant. 457 00:25:25,890 --> 00:25:26,770 That's easy to check. 458 00:25:26,770 --> 00:25:28,850 So you always have linear size. 459 00:25:28,850 --> 00:25:31,842 And the amortized running time becomes constant. 460 00:25:31,842 --> 00:25:34,050 We don't really have time to prove that in the class. 461 00:25:34,050 --> 00:25:35,830 It's a little bit tricky. 462 00:25:35,830 --> 00:25:38,030 Read the textbook if you want to know it. 463 00:25:41,640 --> 00:25:42,575 That's table doubling. 464 00:25:42,575 --> 00:25:43,075 Questions? 465 00:25:45,910 --> 00:25:47,790 All right. 466 00:25:47,790 --> 00:25:48,480 Boring. 467 00:25:48,480 --> 00:25:48,980 No. 468 00:25:48,980 --> 00:25:51,760 It's cool because not only can we 469 00:25:51,760 --> 00:25:55,000 solve the hashing problem of how do we set m in order 470 00:25:55,000 --> 00:25:58,700 to keep alpha a constant, we can also solve Python lists. 471 00:25:58,700 --> 00:26:02,920 Python lists are also known as resizable arrays. 472 00:26:05,980 --> 00:26:07,480 You may have wondered how they work. 473 00:26:07,480 --> 00:26:09,120 Because they offer random access, 474 00:26:09,120 --> 00:26:12,480 we can go to the ith item in constant time 475 00:26:12,480 --> 00:26:14,930 and modify it or get the value. 476 00:26:14,930 --> 00:26:17,990 We can add a new item at the end in constant time. 477 00:26:17,990 --> 00:26:19,640 That's append. 478 00:26:19,640 --> 00:26:20,140 list.append. 479 00:26:24,990 --> 00:26:28,830 And we can delete the last item in constant time. 480 00:26:28,830 --> 00:26:30,375 One version is list.pop. 481 00:26:30,375 --> 00:26:35,300 It's also delete list, square bracket minus 1. 482 00:26:35,300 --> 00:26:37,190 You should know that deleting the first item 483 00:26:37,190 --> 00:26:38,070 is not constant time. 484 00:26:38,070 --> 00:26:39,528 That takes linear time because what 485 00:26:39,528 --> 00:26:42,860 it does is it copies all the values over. 486 00:26:42,860 --> 00:26:45,280 Python lists are implemented by arrays. 487 00:26:45,280 --> 00:26:47,054 But how do you support this dynamicness 488 00:26:47,054 --> 00:26:49,470 where you can increase the length and decrease the length, 489 00:26:49,470 --> 00:26:51,630 and still keep linear space? 490 00:26:51,630 --> 00:26:53,860 Well, you do table doubling. 491 00:26:53,860 --> 00:26:55,360 And I don't know whether Python uses 492 00:26:55,360 --> 00:26:58,550 two or some other constant, but any constant 493 00:26:58,550 --> 00:27:00,710 will do, as long as the deletion constant is 494 00:27:00,710 --> 00:27:05,350 smaller than the insertion constant. 495 00:27:05,350 --> 00:27:06,690 And that's how they work. 496 00:27:06,690 --> 00:27:09,010 So in fact, list.append and list.pop 497 00:27:09,010 --> 00:27:12,240 are constant amortized. 498 00:27:12,240 --> 00:27:14,920 Before, we just said for simplicity, 499 00:27:14,920 --> 00:27:16,712 they're constant time and for the most part 500 00:27:16,712 --> 00:27:18,544 you can just think of them as constant time. 501 00:27:18,544 --> 00:27:20,720 But in reality, they are constant amortized. 502 00:27:20,720 --> 00:27:23,110 Now for fun, just in case you're curious, 503 00:27:23,110 --> 00:27:26,190 you can do all of this stuff in constant worst case 504 00:27:26,190 --> 00:27:27,870 time per operation. 505 00:27:27,870 --> 00:27:30,870 May be a fun exercise. 506 00:27:30,870 --> 00:27:33,090 Do you want to know how? 507 00:27:33,090 --> 00:27:34,350 Yeah? 508 00:27:34,350 --> 00:27:39,680 Rough idea is when you realize that you're 509 00:27:39,680 --> 00:27:43,240 getting kind of full, you start building on the side 510 00:27:43,240 --> 00:27:45,740 a new table of twice the size. 511 00:27:45,740 --> 00:27:48,940 And every time you insert into the actual table, 512 00:27:48,940 --> 00:27:52,280 you move like five of the items over to the new table, 513 00:27:52,280 --> 00:27:54,900 or some constant-- it needs to be a big enough constant. 514 00:27:54,900 --> 00:27:56,630 So that by the time you're full, you just 515 00:27:56,630 --> 00:27:58,880 switch over immediately to the other structure. 516 00:27:58,880 --> 00:28:00,454 It's kind of cool. 517 00:28:00,454 --> 00:28:02,370 It's very tricky to actually get that to work. 518 00:28:02,370 --> 00:28:04,200 But if you're in a real time system, 519 00:28:04,200 --> 00:28:05,710 you might care to know that. 520 00:28:05,710 --> 00:28:08,210 For the most part, people don't implement those things 521 00:28:08,210 --> 00:28:09,790 because they're complicated, but it 522 00:28:09,790 --> 00:28:11,905 is possible to get rid of all these amortized. 523 00:28:14,590 --> 00:28:17,450 Cool. 524 00:28:17,450 --> 00:28:23,734 Let's move onto the next topic, which is more hashing related. 525 00:28:23,734 --> 00:28:25,400 This was sort of general data structures 526 00:28:25,400 --> 00:28:27,710 in order to implement hashing with chaining, 527 00:28:27,710 --> 00:28:32,650 but didn't really care about hashing per se. 528 00:28:32,650 --> 00:28:34,930 We assumed here that we can evaluate the hash function 529 00:28:34,930 --> 00:28:37,660 in constant time, that we can do insertion in constant time, 530 00:28:37,660 --> 00:28:39,517 but that's the name of the game here. 531 00:28:39,517 --> 00:28:41,100 But otherwise, we didn't really care-- 532 00:28:41,100 --> 00:28:43,060 as long as the rebuilding was linear time, 533 00:28:43,060 --> 00:28:44,100 this technique works. 534 00:28:48,240 --> 00:28:55,280 Now we're going to look at a new problem that 535 00:28:55,280 --> 00:28:57,157 has lots of practical applications. 536 00:28:57,157 --> 00:28:58,573 I mentioned some of these problems 537 00:28:58,573 --> 00:29:02,750 in the last class, which is string matching. 538 00:29:02,750 --> 00:29:04,120 This is essentially the problem. 539 00:29:04,120 --> 00:29:06,800 How many people have used Grep in their life? 540 00:29:06,800 --> 00:29:09,510 OK, most of you. 541 00:29:09,510 --> 00:29:13,132 How many people have used Find in a text editor? 542 00:29:13,132 --> 00:29:15,649 OK, the rest of you. 543 00:29:15,649 --> 00:29:17,440 And so this are the same sorts of problems. 544 00:29:17,440 --> 00:29:20,670 You want to search for a pattern, which is just 545 00:29:20,670 --> 00:29:23,360 going to be a substring in some giant string which 546 00:29:23,360 --> 00:29:26,850 is your document, your file, if you will. 547 00:29:26,850 --> 00:29:40,150 So state this formally-- given two strings, s and t, 548 00:29:40,150 --> 00:29:54,460 you want to know does s occur as a substring of t? 549 00:29:54,460 --> 00:30:00,730 So for example, maybe s is a string 6006 550 00:30:00,730 --> 00:30:06,680 and t is your entire-- the mail that you've ever 551 00:30:06,680 --> 00:30:10,130 received in your life or your inbox, or something. 552 00:30:10,130 --> 00:30:12,920 So t is big, typically, and s is small. 553 00:30:12,920 --> 00:30:14,409 It's what you type usually. 554 00:30:14,409 --> 00:30:16,450 Maybe you're searching for all email from Piazza, 555 00:30:16,450 --> 00:30:20,040 so you put the Piazza from string or whatever. 556 00:30:20,040 --> 00:30:22,255 You're searching for that in this giant thing 557 00:30:22,255 --> 00:30:23,750 and you'd like to do that quickly. 558 00:30:26,600 --> 00:30:29,990 Another application, s is what you type in Google. 559 00:30:29,990 --> 00:30:31,620 t is the entire web. 560 00:30:31,620 --> 00:30:32,620 That's what Google does. 561 00:30:32,620 --> 00:30:36,570 It searches for the string in the entire web. 562 00:30:36,570 --> 00:30:39,230 I'm not joking. 563 00:30:39,230 --> 00:30:40,660 OK? 564 00:30:40,660 --> 00:30:43,220 Fine. 565 00:30:43,220 --> 00:30:45,140 So we'd like to do that. 566 00:30:45,140 --> 00:30:49,830 What's the obvious way to search for a substring 567 00:30:49,830 --> 00:30:51,968 in a giant string? 568 00:30:51,968 --> 00:30:53,924 Yeah? 569 00:30:53,924 --> 00:30:56,045 AUDIENCE: Check each substring of that length. 570 00:30:56,045 --> 00:30:58,420 PROFESSOR: Just check each substring of the right length. 571 00:30:58,420 --> 00:31:01,260 So it's got to be the length of s. 572 00:31:01,260 --> 00:31:04,800 And there's only a linear number of them, so check each one. 573 00:31:04,800 --> 00:31:05,640 Let's analyze that. 574 00:31:25,460 --> 00:31:36,740 So a simple algorithm-- actually, 575 00:31:36,740 --> 00:31:38,865 just for fun, I have pseudocode for it. 576 00:31:43,462 --> 00:31:45,450 I have Python code for it. 577 00:31:45,450 --> 00:31:46,944 Even more cool. 578 00:32:09,370 --> 00:32:09,940 OK. 579 00:32:09,940 --> 00:32:11,900 I don't know if you know all these Python features, 580 00:32:11,900 --> 00:32:12,524 but you should. 581 00:32:12,524 --> 00:32:13,440 They're super cool. 582 00:32:13,440 --> 00:32:15,510 This is string splicing. 583 00:32:15,510 --> 00:32:19,890 So we're looking in t-- let me draw the picture. 584 00:32:19,890 --> 00:32:22,660 Here we have s, here we have t. 585 00:32:22,660 --> 00:32:24,080 Think of it as a big string. 586 00:32:24,080 --> 00:32:27,710 We'd like to compare s like that, 587 00:32:27,710 --> 00:32:32,190 and then we'd like to compare s shifted over one to see 588 00:32:32,190 --> 00:32:34,410 whether all of the characters match there. 589 00:32:34,410 --> 00:32:37,790 And then shifted over one more, and so on. 590 00:32:37,790 --> 00:32:40,200 And so we're looking at a substring of t from position 591 00:32:40,200 --> 00:32:42,610 i the position i plus the length of s, 592 00:32:42,610 --> 00:32:44,300 not including the last one. 593 00:32:44,300 --> 00:32:46,980 So that's of length exactly, length of s. 594 00:32:46,980 --> 00:32:48,828 This is s. 595 00:32:48,828 --> 00:32:51,040 This is t. 596 00:32:51,040 --> 00:32:53,080 And so each of these looks like that pattern. 597 00:32:53,080 --> 00:32:54,747 We compare s to t. 598 00:32:54,747 --> 00:32:56,205 What this comparison operation does 599 00:32:56,205 --> 00:32:58,038 in Python is it checks the first characters, 600 00:32:58,038 --> 00:32:59,260 see if they're equal. 601 00:32:59,260 --> 00:33:01,800 If they are, keep going until they find a mismatch. 602 00:33:01,800 --> 00:33:04,510 If there's no mismatch, then you return true. 603 00:33:04,510 --> 00:33:07,170 Otherwise, you return false. 604 00:33:07,170 --> 00:33:10,770 And then we do this roughly length of t times 605 00:33:10,770 --> 00:33:13,270 because that's how many shifts there are, except at the end 606 00:33:13,270 --> 00:33:14,150 we run out of room. 607 00:33:14,150 --> 00:33:16,570 We don't care if we shift beyond the right 608 00:33:16,570 --> 00:33:18,520 because that's clearly not going to match. 609 00:33:18,520 --> 00:33:20,750 And so it's actually length of t minus like of s. 610 00:33:20,750 --> 00:33:22,970 That's the number of iterations. 611 00:33:22,970 --> 00:33:25,835 Hopefully I got all the index arithmetic right. 612 00:33:25,835 --> 00:33:27,460 And there's no plus ones or minus ones. 613 00:33:27,460 --> 00:33:30,220 I think this is correct. 614 00:33:30,220 --> 00:33:32,380 We want to know whether any of these match. 615 00:33:32,380 --> 00:33:39,040 If so, the answer is yes, s occurs as a substring of t. 616 00:33:39,040 --> 00:33:42,010 Of course, in reality you want to know not just do any match, 617 00:33:42,010 --> 00:33:44,170 but show them to me, things like that. 618 00:33:44,170 --> 00:33:47,307 But you can change that. 619 00:33:47,307 --> 00:33:48,140 Same amount of time. 620 00:33:48,140 --> 00:33:50,015 So what's the running time of this algorithm? 621 00:33:54,916 --> 00:33:57,940 So my relevant things are the length of s 622 00:33:57,940 --> 00:34:01,130 and the length of t. 623 00:34:01,130 --> 00:34:03,044 What's the running time? 624 00:34:03,044 --> 00:34:04,280 AUDIENCE: [INAUDIBLE] 625 00:34:04,280 --> 00:34:05,271 PROFESSOR: Sorry? 626 00:34:05,271 --> 00:34:06,145 AUDIENCE: [INAUDIBLE] 627 00:34:06,145 --> 00:34:08,520 PROFESSOR: T by-- t multiplied by s, yeah. 628 00:34:08,520 --> 00:34:10,380 Exactly. 629 00:34:10,380 --> 00:34:14,739 Technically, it's length of s times length of t minus s. 630 00:34:18,760 --> 00:34:24,429 But typically, this is just s times t. 631 00:34:24,429 --> 00:34:25,900 And it's always at most s times t, 632 00:34:25,900 --> 00:34:28,823 and it's usually the same thing because s is usually smaller-- 633 00:34:28,823 --> 00:34:30,239 at least a constant factor than t. 634 00:34:32,800 --> 00:34:33,909 This is kind of slow. 635 00:34:33,909 --> 00:34:38,840 If you're searching for a big string, it's not so great. 636 00:34:38,840 --> 00:34:41,870 I mean, certainly you need s plus t. 637 00:34:41,870 --> 00:34:43,570 You've got to look at the strings. 638 00:34:43,570 --> 00:34:45,935 But s times t is kind of-- it could be quadratic, 639 00:34:45,935 --> 00:34:49,199 if you're searching for a really long string in another string. 640 00:34:49,199 --> 00:34:50,730 So what we'd like to do today is use 641 00:34:50,730 --> 00:34:56,290 hashing to get this down to linear time. 642 00:34:56,290 --> 00:34:58,795 So, ideas? 643 00:34:58,795 --> 00:34:59,670 How could we do that? 644 00:35:09,350 --> 00:35:10,390 Using hashing. 645 00:35:10,390 --> 00:35:11,613 Subtle hint. 646 00:35:11,613 --> 00:35:14,571 Yeah? 647 00:35:14,571 --> 00:35:17,529 AUDIENCE: If we take something into account [INAUDIBLE]. 648 00:35:24,916 --> 00:35:27,040 PROFESSOR: OK, so you want to decompose your string 649 00:35:27,040 --> 00:35:28,840 into words and use the fact that there 650 00:35:28,840 --> 00:35:30,786 are fewer words than characters. 651 00:35:30,786 --> 00:35:32,660 You could probably get something out of that, 652 00:35:32,660 --> 00:35:37,050 and old search engines used to do that. 653 00:35:37,050 --> 00:35:40,050 But it's not necessary, turns out. 654 00:35:40,050 --> 00:35:44,470 And it will also depend on what your average word length is. 655 00:35:44,470 --> 00:35:47,882 We are, in the end, today, we're not going to analyze it fully, 656 00:35:47,882 --> 00:35:49,590 but we are going to get an algorithm that 657 00:35:49,590 --> 00:35:52,490 runs in this time guaranteed. 658 00:35:52,490 --> 00:35:56,891 In expectation because of a randomized-- yeah? 659 00:35:56,891 --> 00:36:00,867 AUDIENCE: If we were to hash [INAUDIBLE] size s, that would 660 00:36:00,867 --> 00:36:06,165 [INAUDIBLE] and then we would check the hash [INAUDIBLE]. 661 00:36:06,165 --> 00:36:06,831 PROFESSOR: Good. 662 00:36:06,831 --> 00:36:09,000 So the idea is to-- what we're looking 663 00:36:09,000 --> 00:36:12,070 at is a rolling window of t always of size s. 664 00:36:12,070 --> 00:36:15,250 And at each time we want to know, is it the same as s? 665 00:36:15,250 --> 00:36:17,667 Now, if somehow-- it's expensive to check 666 00:36:17,667 --> 00:36:19,250 whether a string is equal to a string. 667 00:36:19,250 --> 00:36:21,202 There's no way getting around that. 668 00:36:21,202 --> 00:36:24,610 Well, there are ways, but there isn't a way for just 669 00:36:24,610 --> 00:36:25,640 given two strings. 670 00:36:25,640 --> 00:36:28,330 But if somehow instead of checking the strings we 671 00:36:28,330 --> 00:36:30,630 could check a hash function of the strings, 672 00:36:30,630 --> 00:36:32,340 because strings are big, potentially. 673 00:36:32,340 --> 00:36:34,220 We don't know how big s is. 674 00:36:34,220 --> 00:36:37,790 And so the universe of strings of length s 675 00:36:37,790 --> 00:36:39,336 is potentially very big. 676 00:36:39,336 --> 00:36:40,710 It's expensive to compare things. 677 00:36:40,710 --> 00:36:43,190 If we could just hash it down to some reasonable size, 678 00:36:43,190 --> 00:36:44,740 to something that fits in a word, 679 00:36:44,740 --> 00:36:46,990 then we can compare whether those two words are equal, 680 00:36:46,990 --> 00:36:48,760 whether those two hash values are equal, 681 00:36:48,760 --> 00:36:51,660 whether there's a collision in the table. 682 00:36:51,660 --> 00:36:55,500 That would somehow-- that would make things go faster. 683 00:36:55,500 --> 00:36:59,940 We could do that in constant time per operation. 684 00:36:59,940 --> 00:37:01,760 How could we do that? 685 00:37:01,760 --> 00:37:06,480 That's the tricky part, but that is exactly the right idea. 686 00:37:06,480 --> 00:37:12,410 So-- make some space. 687 00:37:28,130 --> 00:37:32,220 I think I'm going to do things a little out of order 688 00:37:32,220 --> 00:37:34,200 from what I have in my notes, and tell you 689 00:37:34,200 --> 00:37:37,690 about something called rolling hashes. 690 00:37:37,690 --> 00:37:40,870 And then we'll see how they're used. 691 00:37:40,870 --> 00:37:42,170 So shelve that idea. 692 00:37:42,170 --> 00:37:43,720 We're going to come back to it. 693 00:37:43,720 --> 00:37:48,430 We need a data structure to help us do this. 694 00:37:48,430 --> 00:37:51,380 Because if we just compute the hash function of this thing, 695 00:37:51,380 --> 00:37:52,680 compare it to the hash function of this thing, 696 00:37:52,680 --> 00:37:54,971 and then compute the hash function of the shifted value 697 00:37:54,971 --> 00:37:56,650 of t and compare it, we don't have 698 00:37:56,650 --> 00:37:58,350 to recompute the hash of s. 699 00:37:58,350 --> 00:38:00,960 That's going to be free once you do it once. 700 00:38:00,960 --> 00:38:02,650 But computing the hash function of this 701 00:38:02,650 --> 00:38:03,433 and then the hash function of this 702 00:38:03,433 --> 00:38:05,190 and the hash function of this, usually 703 00:38:05,190 --> 00:38:06,773 to compute each of those hash function 704 00:38:06,773 --> 00:38:08,355 would take length of s time. 705 00:38:08,355 --> 00:38:10,810 And so we're not saving any time. 706 00:38:10,810 --> 00:38:13,660 Somehow, if we have the hash function of this, 707 00:38:13,660 --> 00:38:15,860 the first substring of length s, we'd 708 00:38:15,860 --> 00:38:18,010 like to very quickly compute the hash function 709 00:38:18,010 --> 00:38:21,706 of the next substring in constant time. 710 00:38:21,706 --> 00:38:22,206 Yeah? 711 00:38:22,206 --> 00:38:24,158 AUDIENCE: You already have, like, s minus 1 712 00:38:24,158 --> 00:38:25,444 of the characters of the-- 713 00:38:25,444 --> 00:38:26,110 PROFESSOR: Yeah. 714 00:38:26,110 --> 00:38:30,690 If you look at this portion of s and this portion of s, 715 00:38:30,690 --> 00:38:33,600 they share s minus 1 of the characters. 716 00:38:33,600 --> 00:38:35,540 Just one character different. 717 00:38:35,540 --> 00:38:38,980 First one gets deleted, last character gets added. 718 00:38:38,980 --> 00:38:39,980 So here's what we want. 719 00:38:43,230 --> 00:38:47,960 Given a hash value-- maybe I should call this r. 720 00:38:47,960 --> 00:38:49,280 It's not the hash function. 721 00:38:51,820 --> 00:38:53,670 Give it a rolling hash value. 722 00:38:53,670 --> 00:38:57,490 You might say, I'd like to be able to append a character. 723 00:38:57,490 --> 00:39:05,480 I should say, r maintains a string. 724 00:39:09,100 --> 00:39:13,740 There's some string, let's call it x. 725 00:39:13,740 --> 00:39:18,730 And what r.append of c does is add 726 00:39:18,730 --> 00:39:22,715 character c to the end of x. 727 00:39:25,620 --> 00:39:27,460 And then we also want an operation 728 00:39:27,460 --> 00:39:30,160 which is-- you might call it pop left in Python. 729 00:39:30,160 --> 00:39:32,900 I'm going to call it skip. 730 00:39:32,900 --> 00:39:34,480 Shorter. 731 00:39:34,480 --> 00:39:40,880 Delete the first character of x. 732 00:39:44,370 --> 00:39:52,690 And assuming it's c. 733 00:39:52,690 --> 00:39:55,370 So we can do this because over here, 734 00:39:55,370 --> 00:39:58,610 what we want to do is add this character, which 735 00:39:58,610 --> 00:40:01,960 is like t of length of s. 736 00:40:01,960 --> 00:40:03,730 And we want to delete this character 737 00:40:03,730 --> 00:40:06,080 from the front, which is t of 0. 738 00:40:06,080 --> 00:40:10,125 Then we will get the next strength. 739 00:40:10,125 --> 00:40:15,200 And at all times, r-- what's the point of this r? 740 00:40:15,200 --> 00:40:18,370 You can say r-- let's say open paren, close 741 00:40:18,370 --> 00:40:22,030 paren-- this will give you a hash value 742 00:40:22,030 --> 00:40:25,280 of the current strength. 743 00:40:25,280 --> 00:40:29,930 So this is basically h of x for some hash function 744 00:40:29,930 --> 00:40:32,882 h, some reasonable hash function. 745 00:40:32,882 --> 00:40:34,340 If we could do this and we could do 746 00:40:34,340 --> 00:40:37,390 each of these operations in constant time, 747 00:40:37,390 --> 00:40:39,070 then we can do string matching. 748 00:40:39,070 --> 00:40:42,210 Let me tell you how. 749 00:40:42,210 --> 00:40:45,530 This is called the Karp-Rabin string matching algorithm. 750 00:40:56,269 --> 00:40:58,760 And if it's not clear exactly what's allowed here, 751 00:40:58,760 --> 00:41:00,510 you'll see it as we use it. 752 00:41:12,740 --> 00:41:15,500 First thing I'd like to do is compute the hash function of s. 753 00:41:15,500 --> 00:41:17,992 I only need to do that once, so I'll do it. 754 00:41:17,992 --> 00:41:20,450 In this data structure, the only thing you're allowed to do 755 00:41:20,450 --> 00:41:21,310 is add characters. 756 00:41:21,310 --> 00:41:23,110 Initially you have an empty string. 757 00:41:23,110 --> 00:41:25,770 And so for each character in s I'll just append it, 758 00:41:25,770 --> 00:41:30,510 and now rs gives me a hash value of s. 759 00:41:30,510 --> 00:41:31,880 OK? 760 00:41:31,880 --> 00:41:37,340 Now, I'd like to get started and compute 761 00:41:37,340 --> 00:41:40,480 the hash function of the first s characters of t. 762 00:41:43,840 --> 00:41:48,250 So this would be t up to length of s. 763 00:41:51,650 --> 00:41:54,060 And I'm going to call this thing rt, that's 764 00:41:54,060 --> 00:41:55,585 my rolling hash for t. 765 00:41:59,694 --> 00:42:00,860 And append those characters. 766 00:42:00,860 --> 00:42:03,820 So now rs is a rolling hash of s. 767 00:42:03,820 --> 00:42:07,934 rt is a rolling hash of the first s characters in t. 768 00:42:07,934 --> 00:42:09,600 So I should check whether they're equal. 769 00:42:09,600 --> 00:42:12,790 If they're not, shift over by one. 770 00:42:12,790 --> 00:42:14,706 Add one character at the end, delete character 771 00:42:14,706 --> 00:42:15,980 from the beginning. 772 00:42:15,980 --> 00:42:18,970 I'm going to have to do this many times. 773 00:42:18,970 --> 00:42:23,360 So I guess technically, I need to check 774 00:42:23,360 --> 00:42:24,610 whether these are equal first. 775 00:42:27,580 --> 00:42:31,052 If they're equal, then we'll talk about it in a moment. 776 00:42:31,052 --> 00:42:32,510 The main thing I need to do is this 777 00:42:32,510 --> 00:42:35,780 for loop, which checks all of the other. 778 00:42:45,530 --> 00:42:50,250 And all I need to do is throw away the first letter, which 779 00:42:50,250 --> 00:42:53,425 I know is t of i minus length of s. 780 00:42:56,240 --> 00:42:59,970 And add the next letter, which is going to be t of i. 781 00:43:04,660 --> 00:43:07,376 And then after I do that, I don't change hs 782 00:43:07,376 --> 00:43:08,250 because that's fixed. 783 00:43:08,250 --> 00:43:10,400 That's just-- or, sorry, I switched from h 784 00:43:10,400 --> 00:43:13,860 to-- in my notes I have h. 785 00:43:13,860 --> 00:43:17,615 I've been switching to r, so all those h's are r's. 786 00:43:17,615 --> 00:43:20,440 Sorry about that. 787 00:43:20,440 --> 00:43:36,410 So then if rs equals rt, then potentially that substring of t 788 00:43:36,410 --> 00:43:38,120 matches s. 789 00:43:38,120 --> 00:43:41,410 But it's potentially because we're hashing. 790 00:43:41,410 --> 00:43:44,830 Things are only true in expectation. 791 00:43:44,830 --> 00:43:47,020 There's some probability of failure. 792 00:43:47,020 --> 00:43:50,820 Just because the hash function of two strings comes out equal 793 00:43:50,820 --> 00:43:52,930 doesn't mean the strings themselves are equal, 794 00:43:52,930 --> 00:43:54,910 because there are collisions. 795 00:43:54,910 --> 00:43:58,520 Even distinct strings may map to the same slot in the table. 796 00:43:58,520 --> 00:44:04,810 So what we do in this situation is check 797 00:44:04,810 --> 00:44:13,630 whether s equals t-- I did it slightly less conveniently 798 00:44:13,630 --> 00:44:23,100 than before-- it's like i minus length of s plus 1 to i plus 1. 799 00:44:23,100 --> 00:44:23,790 Oh well. 800 00:44:23,790 --> 00:44:28,790 It wasn't very beautiful but it works. 801 00:44:28,790 --> 00:44:31,100 So in this case, I'm going to check it character 802 00:44:31,100 --> 00:44:32,810 by character. 803 00:44:32,810 --> 00:44:34,840 OK? 804 00:44:34,840 --> 00:44:39,330 If they're equal, then we found a match. 805 00:44:39,330 --> 00:44:42,537 So it's kind of OK that I spent all this time to check them. 806 00:44:42,537 --> 00:44:44,870 In particular, if I'm just looking for the first match-- 807 00:44:44,870 --> 00:44:47,410 like you're searching through a text document, 808 00:44:47,410 --> 00:44:50,660 you just care about the first match-- then you're done. 809 00:44:50,660 --> 00:44:54,494 So yeah, I spent order s time to do this, 810 00:44:54,494 --> 00:44:56,660 but if they're equal it's sort of worth that effort. 811 00:44:56,660 --> 00:44:58,760 I found the match. 812 00:44:58,760 --> 00:45:02,950 If they're not equal, we basically 813 00:45:02,950 --> 00:45:07,160 hope or we will engineer it so that this 814 00:45:07,160 --> 00:45:09,770 happens with probability at most 1/s. 815 00:45:20,840 --> 00:45:24,775 If we can do that, then the expected time here is constant. 816 00:45:33,910 --> 00:45:40,530 So that would be good because then, if skip and append 817 00:45:40,530 --> 00:45:44,889 take constant time and this sort of double checking 818 00:45:44,889 --> 00:45:47,180 only takes constant expected time-- except when we find 819 00:45:47,180 --> 00:45:49,480 matches and then we're OK with it-- 820 00:45:49,480 --> 00:45:52,760 then this overall thing will take linear time. 821 00:45:52,760 --> 00:46:02,570 In fact, the proper thing would be this is you pay s plus t, 822 00:46:02,570 --> 00:46:07,370 then you also pay-- for each match that you want to report, 823 00:46:07,370 --> 00:46:08,730 you pay length of s. 824 00:46:10,947 --> 00:46:13,030 I'm not sure whether you can get rid of that term. 825 00:46:13,030 --> 00:46:15,196 But in particular, if you just care about one match, 826 00:46:15,196 --> 00:46:16,080 this is linear time. 827 00:46:19,167 --> 00:46:20,250 It's pretty cool. 828 00:46:23,670 --> 00:46:25,280 There's one remaining question, which 829 00:46:25,280 --> 00:46:28,400 is how do you build this data structure? 830 00:46:28,400 --> 00:46:30,700 Is the algorithm clear though? 831 00:46:30,700 --> 00:46:32,270 I mean, I wrote it out in gory detail 832 00:46:32,270 --> 00:46:33,895 so you can really see what's happening, 833 00:46:33,895 --> 00:46:36,070 also because you need to do it in your problem set 834 00:46:36,070 --> 00:46:39,660 so I give you as much code to work from as possible. 835 00:46:39,660 --> 00:46:40,800 Question? 836 00:46:40,800 --> 00:46:42,300 AUDIENCE: What is rs? 837 00:46:42,300 --> 00:46:49,100 PROFESSOR: rs is going to represent a hash value of s. 838 00:46:49,100 --> 00:46:50,770 You could just say h of s. 839 00:46:50,770 --> 00:46:54,090 But what I like to show is that all you need 840 00:46:54,090 --> 00:46:55,644 are these operations. 841 00:46:55,644 --> 00:46:57,060 And so given a data structure that 842 00:46:57,060 --> 00:47:01,620 will compute a hash function, given the append operation, 843 00:47:01,620 --> 00:47:06,330 what I did up here was just append every letter of s 844 00:47:06,330 --> 00:47:08,970 into this thing, and then rs open paren, 845 00:47:08,970 --> 00:47:11,673 close paren gives me the hash function of s. 846 00:47:11,673 --> 00:47:13,714 AUDIENCE: You said you can do r.append over here, 847 00:47:13,714 --> 00:47:15,690 but then you said rs-- 848 00:47:15,690 --> 00:47:16,580 PROFESSOR: Yeah. 849 00:47:16,580 --> 00:47:18,140 So there are two rolling hashes. 850 00:47:18,140 --> 00:47:22,500 One's called rs and one's called rt. 851 00:47:22,500 --> 00:47:26,040 This was an ADT and I didn't say it at the beginning-- line 852 00:47:26,040 --> 00:47:28,640 one I say rs equals a new rolling hash. rt equals 853 00:47:28,640 --> 00:47:29,890 a new rolling hash. 854 00:47:29,890 --> 00:47:32,574 Sorry, I should bind my variables. 855 00:47:32,574 --> 00:47:33,990 So I'm using two of them because I 856 00:47:33,990 --> 00:47:36,700 want to compare their values, like this. 857 00:47:39,390 --> 00:47:42,090 Other questions? 858 00:47:42,090 --> 00:47:43,450 It's actually a pretty big idea. 859 00:47:43,450 --> 00:47:48,570 This is an algorithm from the '90s, so it's fairly recent. 860 00:47:51,769 --> 00:47:54,290 And it's one of the first examples 861 00:47:54,290 --> 00:47:58,120 of really using randomization in a super cool way, other 862 00:47:58,120 --> 00:48:00,070 than just hashing as a data structure. 863 00:48:04,240 --> 00:48:04,790 All right. 864 00:48:04,790 --> 00:48:08,400 So the remaining thing to do is figure out 865 00:48:08,400 --> 00:48:09,932 how to build this ADT. 866 00:48:09,932 --> 00:48:11,640 What's the data structure that implements 867 00:48:11,640 --> 00:48:15,015 this, spending constant time for each of these operations. 868 00:48:24,370 --> 00:48:25,820 Now, to tell you the truth, doing 869 00:48:25,820 --> 00:48:28,750 it depends on which hashing method you use, which hash 870 00:48:28,750 --> 00:48:30,916 function you want to use. 871 00:48:30,916 --> 00:48:32,540 I just erased the multiplication method 872 00:48:32,540 --> 00:48:34,748 because it's a pain to use the multiplication method. 873 00:48:40,360 --> 00:48:42,900 Though I'll bet you could use it, actually. 874 00:48:42,900 --> 00:48:45,251 That's an exercise for you think about. 875 00:48:45,251 --> 00:48:46,750 I'm going to use the division method 876 00:48:46,750 --> 00:48:48,660 because it's the simplest hash function. 877 00:48:48,660 --> 00:48:50,790 And it turns out, in this setting it does work. 878 00:48:50,790 --> 00:48:53,760 We're not going to prove that this is true. 879 00:48:53,760 --> 00:48:56,860 This is going to be true in expectation. 880 00:48:56,860 --> 00:48:57,635 Expected time. 881 00:49:02,110 --> 00:49:06,270 But Karp and Rabin proved that this running time 882 00:49:06,270 --> 00:49:09,040 holds, even if you just use a simple hash 883 00:49:09,040 --> 00:49:11,570 function of the division method where 884 00:49:11,570 --> 00:49:13,755 m is chosen to be a random prime. 885 00:49:18,800 --> 00:49:22,170 Let's say about is big as-- let's say at least as 886 00:49:22,170 --> 00:49:26,050 big as length of s. 887 00:49:26,050 --> 00:49:28,340 The bigger you make it, the higher probability this 888 00:49:28,340 --> 00:49:29,640 is going to be true. 889 00:49:29,640 --> 00:49:34,484 But length of s will give you this on average. 890 00:49:34,484 --> 00:49:36,400 So we're not going to talk about in this class 891 00:49:36,400 --> 00:49:39,620 how to find a random prime, but the algorithm 892 00:49:39,620 --> 00:49:42,650 is choose a random number of about the right size 893 00:49:42,650 --> 00:49:44,030 and check whether it's prime. 894 00:49:44,030 --> 00:49:46,030 If it's not, do it again. 895 00:49:46,030 --> 00:49:50,100 And by the prime number theorem, after log end trials 896 00:49:50,100 --> 00:49:51,779 you will find a prime. 897 00:49:51,779 --> 00:49:53,320 And we're not going to talk about how 898 00:49:53,320 --> 00:49:57,870 to check whether a number's prime, but it can be done. 899 00:49:57,870 --> 00:49:58,480 All right. 900 00:49:58,480 --> 00:50:02,220 So we're basically done. 901 00:50:02,220 --> 00:50:10,630 The point is to look at-- if you look at an append operation 902 00:50:10,630 --> 00:50:15,380 and you think about how this hash function changes 903 00:50:15,380 --> 00:50:17,350 when you add a single character. 904 00:50:17,350 --> 00:50:20,470 Oh, I should tell you. 905 00:50:20,470 --> 00:50:25,950 We're going to treat the string x as a multi digit number. 906 00:50:29,650 --> 00:50:31,220 This is the sort of prehash function. 907 00:50:36,480 --> 00:50:39,365 And the base is the size of your alphabet. 908 00:50:42,750 --> 00:50:45,920 So if you're using Ascii, it's 256. 909 00:50:45,920 --> 00:50:48,860 If you're using some unique code, it might be larger. 910 00:50:48,860 --> 00:50:52,750 But whatever the size of your characters in your string, 911 00:50:52,750 --> 00:50:56,950 then when I add a character, this is like taking my number, 912 00:50:56,950 --> 00:51:00,660 shifting it over by one, and then adding a new value. 913 00:51:00,660 --> 00:51:02,390 So how do I shift over by one? 914 00:51:02,390 --> 00:51:04,630 I multiply by a. 915 00:51:04,630 --> 00:51:10,330 So if I have some value, some current hash value u, 916 00:51:10,330 --> 00:51:13,440 it changes to u times a-- or sorry, 917 00:51:13,440 --> 00:51:17,890 this is the number represented by the string. 918 00:51:17,890 --> 00:51:20,460 I multiply by a and then I add on the character. 919 00:51:20,460 --> 00:51:23,620 Or, in Python you'd write ord of the character. 920 00:51:23,620 --> 00:51:27,860 That's the number associated with that character. 921 00:51:27,860 --> 00:51:29,160 That gives me the new string. 922 00:51:29,160 --> 00:51:29,770 Very easy. 923 00:51:29,770 --> 00:51:33,720 If I want to do is skip, it's slightly more annoying. 924 00:51:33,720 --> 00:51:37,290 But skip means just annihilate this value. 925 00:51:37,290 --> 00:51:45,310 And so it's like u goes to u minus the character times a 926 00:51:45,310 --> 00:51:48,980 to the power size of u minus 1. 927 00:51:48,980 --> 00:51:52,080 I have to shift this character over to that position 928 00:51:52,080 --> 00:51:53,830 and then annihilated it with a minus sign. 929 00:51:53,830 --> 00:51:56,340 You could also do x or. 930 00:51:56,340 --> 00:51:58,820 And when I do this, I just think about how 931 00:51:58,820 --> 00:52:00,250 the hash function is changing. 932 00:52:00,250 --> 00:52:02,540 Everything is just modulo m. 933 00:52:02,540 --> 00:52:05,370 So if I have some hash value here, r, 934 00:52:05,370 --> 00:52:10,000 I take r times a plus ord of c and I just 935 00:52:10,000 --> 00:52:13,210 do that computation modulo m, and I'll 936 00:52:13,210 --> 00:52:15,140 get the new hash value. 937 00:52:15,140 --> 00:52:18,630 Do the same thing down here, I'll get the new hash value. 938 00:52:18,630 --> 00:52:22,730 So what r stores is the current hash value. 939 00:52:22,730 --> 00:52:27,810 And it stores a to the power length of u or length 940 00:52:27,810 --> 00:52:30,200 of x, whatever you want to call it. 941 00:52:30,200 --> 00:52:33,606 I guess that would be a little better. 942 00:52:33,606 --> 00:52:35,480 And then it can do these in constant a number 943 00:52:35,480 --> 00:52:36,280 of operations. 944 00:52:36,280 --> 00:52:37,955 Just compute everything modulo m, 945 00:52:37,955 --> 00:52:40,124 one multiplication, one addition. 946 00:52:40,124 --> 00:52:41,790 You can do append and skip, and then you 947 00:52:41,790 --> 00:52:43,560 have the hash value instantly. 948 00:52:43,560 --> 00:52:44,820 It's just stored. 949 00:52:44,820 --> 00:52:47,330 And then you can make all this work.