1 00:00:00,050 --> 00:00:01,770 The following content is provided 2 00:00:01,770 --> 00:00:04,010 under a Creative Commons license. 3 00:00:04,010 --> 00:00:06,860 Your support will help MIT OpenCourseWare continue 4 00:00:06,860 --> 00:00:10,720 to offer high quality educational resources for free. 5 00:00:10,720 --> 00:00:13,330 To make a donation or view additional materials 6 00:00:13,330 --> 00:00:17,207 from hundreds of MIT courses, visit MIT OpenCourseWare 7 00:00:17,207 --> 00:00:17,832 at ocw.mit.edu. 8 00:00:21,835 --> 00:00:22,710 PROFESSOR: All right. 9 00:00:22,710 --> 00:00:24,980 Let's get started. 10 00:00:24,980 --> 00:00:27,730 Today we start a brand new section of 006, 11 00:00:27,730 --> 00:00:29,620 which is hashing. 12 00:00:29,620 --> 00:00:30,430 Hashing is cool. 13 00:00:30,430 --> 00:00:34,230 It is probably the most used and common and important 14 00:00:34,230 --> 00:00:36,495 data structure and all of computer science. 15 00:00:36,495 --> 00:00:41,860 It's in, basically, every system you've ever used, I think. 16 00:00:41,860 --> 00:00:44,220 And in particular, it's in Python 17 00:00:44,220 --> 00:00:46,326 as part of what makes Python fun to program in. 18 00:00:46,326 --> 00:00:49,740 And basically, every modern programming language has it. 19 00:00:49,740 --> 00:00:53,610 So today is about how to make it actually happen. 20 00:00:53,610 --> 00:00:54,660 So what is it? 21 00:00:57,230 --> 00:00:58,690 It is usually called a dictionary. 22 00:01:01,960 --> 00:01:04,280 So this is an abstract data if you 23 00:01:04,280 --> 00:01:07,245 remember that term from a couple lectures ago. 24 00:01:13,080 --> 00:01:15,450 It's kind of an old term, not so common anymore, 25 00:01:15,450 --> 00:01:18,340 but it's useful to think about. 26 00:01:18,340 --> 00:01:22,900 So a dictionary is a data structure, 27 00:01:22,900 --> 00:01:26,900 or it's a thing, that can store items, 28 00:01:26,900 --> 00:01:29,395 and it can insert items, delete items and search for items. 29 00:01:35,180 --> 00:01:40,230 So in general, it's going to be a set of items, 30 00:01:40,230 --> 00:01:41,360 each item has a key. 31 00:01:47,720 --> 00:01:56,080 And you can insert an item, you can delete an item 32 00:01:56,080 --> 00:02:06,720 from the set, and you can search for a key, not an item. 33 00:02:06,720 --> 00:02:09,143 And the interesting part is the search. 34 00:02:09,143 --> 00:02:10,934 I think you know what insert and delete do. 35 00:02:20,360 --> 00:02:23,015 So there are two outcomes to this kind of search. 36 00:02:23,015 --> 00:02:25,270 This is what I call an exact search. 37 00:02:25,270 --> 00:02:29,440 Either you find an item with a given key, or there isn't one, 38 00:02:29,440 --> 00:02:32,535 and then you just say key error in Python. 39 00:02:41,950 --> 00:02:42,450 OK. 40 00:02:42,450 --> 00:02:44,149 This is a little different from what 41 00:02:44,149 --> 00:02:45,690 we could do with binary search trees. 42 00:02:45,690 --> 00:02:47,690 Binary search trees, if we didn't find a key, 43 00:02:47,690 --> 00:02:50,900 we could find the next larger and the next smaller 44 00:02:50,900 --> 00:02:52,661 successor and predecessor. 45 00:02:52,661 --> 00:02:54,660 With dictionaries you're not allowed to do that, 46 00:02:54,660 --> 00:02:56,150 or you're not able to do that. 47 00:02:56,150 --> 00:02:57,900 And you're just interested in the question 48 00:02:57,900 --> 00:02:58,930 does the key exist? 49 00:02:58,930 --> 00:03:02,090 And if so, give me the item with that key. 50 00:03:02,090 --> 00:03:04,990 So we're assuming here that the items have unique keys, 51 00:03:04,990 --> 00:03:07,362 no two items have the same key. 52 00:03:07,362 --> 00:03:08,820 And one way to enforce that is when 53 00:03:08,820 --> 00:03:11,430 you insert an item with an existing key, 54 00:03:11,430 --> 00:03:13,140 it overwrites whatever key was there. 55 00:03:13,140 --> 00:03:15,070 That's the Python behavior. 56 00:03:15,070 --> 00:03:18,550 So we'll assume that. 57 00:03:18,550 --> 00:03:25,650 Overwrite any existing key. 58 00:03:31,730 --> 00:03:34,480 And so, it's well defined what search does. 59 00:03:34,480 --> 00:03:36,116 Either there's one item with that key, 60 00:03:36,116 --> 00:03:37,490 or there's no item with that key, 61 00:03:37,490 --> 00:03:41,260 and it tells you what the situation is. 62 00:03:41,260 --> 00:03:41,760 OK. 63 00:03:41,760 --> 00:03:47,710 So one way to solve dictionaries is 64 00:03:47,710 --> 00:03:51,150 to use a balanced binary search tree like AVL trees. 65 00:03:51,150 --> 00:03:54,710 And so you can do all of these operations on log n time. 66 00:04:01,720 --> 00:04:04,220 I mean, you can ignore the fact that AVL trees give you 67 00:04:04,220 --> 00:04:06,200 more information when you do a search, 68 00:04:06,200 --> 00:04:08,540 and still does exact search. 69 00:04:08,540 --> 00:04:12,690 So that's one solution, but it turns out you can do better. 70 00:04:12,690 --> 00:04:16,110 And while last class was about, well, in the comparison model 71 00:04:16,110 --> 00:04:20,120 the best way to sort is n log n and the best way to search 72 00:04:20,120 --> 00:04:21,570 is log n. 73 00:04:21,570 --> 00:04:23,870 Then we saw in the RAM model, where 74 00:04:23,870 --> 00:04:27,600 if you assume your items are integers we can sort faster, 75 00:04:27,600 --> 00:04:29,440 sometimes we can sort in linear time. 76 00:04:29,440 --> 00:04:33,710 Today's lecture is about how to search faster than log n time. 77 00:04:33,710 --> 00:04:37,680 And we're going to get down to constant time. 78 00:04:37,680 --> 00:04:41,020 No-- basically, no assumptions except, maybe, 79 00:04:41,020 --> 00:04:43,110 that your keys are integers. 80 00:04:43,110 --> 00:04:45,449 We'll be able to get down to constant time 81 00:04:45,449 --> 00:04:46,365 with high probability. 82 00:04:48,890 --> 00:04:51,030 It's going to be a randomized data structure. 83 00:04:51,030 --> 00:04:53,490 It's one of the few instances of randomization in 006, 84 00:04:53,490 --> 00:04:56,260 but it'll be pretty simple to analyze, so don't worry. 85 00:04:56,260 --> 00:04:59,332 But we're going to use some probability today. 86 00:04:59,332 --> 00:05:00,415 Make it a little exciting. 87 00:05:03,290 --> 00:05:05,470 I think you know how dictionaries work in Python. 88 00:05:05,470 --> 00:05:11,810 In Python it's the dict data type. 89 00:05:11,810 --> 00:05:14,600 We've used it all over the place. 90 00:05:14,600 --> 00:05:17,760 The key things you can do are lookup a key 91 00:05:17,760 --> 00:05:24,100 and-- so this is the analog of search-- 92 00:05:24,100 --> 00:05:27,970 you can set a key to a value. 93 00:05:27,970 --> 00:05:30,960 This is the analog of an insert. 94 00:05:30,960 --> 00:05:33,170 It overwrites whatever was there. 95 00:05:33,170 --> 00:05:33,810 And what else? 96 00:05:33,810 --> 00:05:34,610 Delete. 97 00:05:34,610 --> 00:05:38,130 So you can delete a particular key. 98 00:05:42,114 --> 00:05:42,760 OK. 99 00:05:42,760 --> 00:05:44,593 We'll usually use this notation because it's 100 00:05:44,593 --> 00:05:46,340 more familiar and intuitive. 101 00:05:46,340 --> 00:05:48,690 But the big topic today is how do you actually 102 00:05:48,690 --> 00:05:53,070 implement these operations for a dictionary, D? 103 00:05:53,070 --> 00:05:56,360 The one specific thing about Python dictionaries 104 00:05:56,360 --> 00:06:01,465 is that an item is basically a pair 105 00:06:01,465 --> 00:06:05,380 of two things, a key and a value. 106 00:06:05,380 --> 00:06:07,410 And so, in particular, when you call d.items 107 00:06:07,410 --> 00:06:11,280 you get a whole bunch of ordered pairs, a key and a value. 108 00:06:11,280 --> 00:06:13,220 And so the key is always-- the key of an item 109 00:06:13,220 --> 00:06:15,152 is always this first part. 110 00:06:15,152 --> 00:06:16,135 So it's well defined. 111 00:06:20,035 --> 00:06:20,535 OK. 112 00:06:23,070 --> 00:06:28,120 So that's Python dictionaries. 113 00:06:28,120 --> 00:06:32,530 So one obvious motivation for building dictionaries 114 00:06:32,530 --> 00:06:34,980 is you need them in Python. 115 00:06:34,980 --> 00:06:37,380 And in fact, people use them all the time. 116 00:06:37,380 --> 00:06:39,830 We used them in docdist. 117 00:06:39,830 --> 00:06:43,890 All of the fastest versions of the document distance problem 118 00:06:43,890 --> 00:06:48,080 used dictionaries for counting words, how many times each word 119 00:06:48,080 --> 00:06:51,470 occurs in a document, and for computing inner products, 120 00:06:51,470 --> 00:06:54,640 for finding common words between two documents. 121 00:06:54,640 --> 00:06:57,035 And it's just it's the best way to do things, 122 00:06:57,035 --> 00:07:00,467 it's the easiest way to do things , and the fastest. 123 00:07:00,467 --> 00:07:02,800 As a result, dictionaries are built into basically every 124 00:07:02,800 --> 00:07:06,980 modern programming language, Python, Perl, Ruby, JavaScript, 125 00:07:06,980 --> 00:07:08,110 Java, C++, C#. 126 00:07:08,110 --> 00:07:10,970 In modern versions, all have some version of dictionaries. 127 00:07:10,970 --> 00:07:13,790 And they all run in, basically, constant time 128 00:07:13,790 --> 00:07:16,615 using the stuff that's in this lecture and the next two 129 00:07:16,615 --> 00:07:17,115 lectures. 130 00:07:20,130 --> 00:07:21,300 Let's see. 131 00:07:21,300 --> 00:07:24,085 It's also, in, basically, every database. 132 00:07:26,894 --> 00:07:29,310 There are essentially two kinds of databases in the world, 133 00:07:29,310 --> 00:07:30,684 there are those that use hashing, 134 00:07:30,684 --> 00:07:32,800 and there are those that use search trees. 135 00:07:32,800 --> 00:07:33,760 Sometimes you need one. 136 00:07:33,760 --> 00:07:35,095 Sometimes you need the other. 137 00:07:35,095 --> 00:07:37,470 There are a lot of situations in databases where you just 138 00:07:37,470 --> 00:07:39,082 need hashing. 139 00:07:39,082 --> 00:07:40,540 So if you've ever used Berkeley DB, 140 00:07:40,540 --> 00:07:44,450 there's a hash type of a database. 141 00:07:44,450 --> 00:07:48,460 So if things like, when you go to Merriam-Webster, 142 00:07:48,460 --> 00:07:51,200 and you look up a word, how do you 143 00:07:51,200 --> 00:07:53,860 find the definition of that word? 144 00:07:53,860 --> 00:07:58,090 You use a hash table, you use a dictionary, I should say. 145 00:07:58,090 --> 00:08:02,100 How do you-- when you spell check your document, 146 00:08:02,100 --> 00:08:04,360 how do you tell whether a word is correctly spelled? 147 00:08:04,360 --> 00:08:05,794 You look it up in a dictionary. 148 00:08:05,794 --> 00:08:07,210 If it's not correctly spelled, how 149 00:08:07,210 --> 00:08:11,520 do you find the closest related, correct spelling? 150 00:08:11,520 --> 00:08:12,895 You try tweaking one the letters, 151 00:08:12,895 --> 00:08:15,103 and look it up in a dictionary and see if it's there. 152 00:08:15,103 --> 00:08:17,600 You do that for all possible letters, or maybe two letters. 153 00:08:17,600 --> 00:08:21,899 That is a state of the art way to do spelling correction. 154 00:08:21,899 --> 00:08:23,440 Just keep looking up in a dictionary. 155 00:08:23,440 --> 00:08:25,446 Because dictionaries are so fast you 156 00:08:25,446 --> 00:08:27,945 can afford to do things like trial perturbations of letters. 157 00:08:30,820 --> 00:08:32,039 What else. 158 00:08:32,039 --> 00:08:34,770 In the old days, which means pre-Google, 159 00:08:34,770 --> 00:08:38,030 every search engine on the web would 160 00:08:38,030 --> 00:08:41,260 have a dictionary that says, for given word, 161 00:08:41,260 --> 00:08:44,120 give me all of the documents containing that word. 162 00:08:44,120 --> 00:08:48,760 Google doesn't do it that way, but that's another story. 163 00:08:48,760 --> 00:08:50,870 It's less fancy, actually. 164 00:08:50,870 --> 00:08:52,960 Or when you log into a system, you 165 00:08:52,960 --> 00:08:54,940 type your username and password. 166 00:08:54,940 --> 00:08:57,762 You look in a dictionary that stores a username 167 00:08:57,762 --> 00:08:59,220 and, associated with that username, 168 00:08:59,220 --> 00:09:00,957 all the information of that user. 169 00:09:00,957 --> 00:09:03,040 Every time you log into a web system, or whatever, 170 00:09:03,040 --> 00:09:05,440 it is going through a dictionary. 171 00:09:05,440 --> 00:09:07,520 So they're all over the place. 172 00:09:07,520 --> 00:09:09,330 One of the original applications is 173 00:09:09,330 --> 00:09:11,212 in writing programming languages. 174 00:09:11,212 --> 00:09:12,670 Some of the first computer programs 175 00:09:12,670 --> 00:09:15,076 were programming languages, so you could actually 176 00:09:15,076 --> 00:09:16,450 program them in a reasonable way. 177 00:09:21,860 --> 00:09:25,497 Whenever you type a variable name the computer doesn't 178 00:09:25,497 --> 00:09:27,080 really think about that variable name, 179 00:09:27,080 --> 00:09:29,500 it wants to think about an address in memory. 180 00:09:29,500 --> 00:09:31,820 And so you've got to translate that variable name 181 00:09:31,820 --> 00:09:36,340 into a real, physical address in the machine, or a position 182 00:09:36,340 --> 00:09:39,950 on the stack, or whatever it is in real life. 183 00:09:39,950 --> 00:09:41,950 In the old days of Python, I guess 184 00:09:41,950 --> 00:09:45,550 this is pre-Python 2 or so, 2.1, I 185 00:09:45,550 --> 00:09:48,390 don't remember the exact transition it was. 186 00:09:48,390 --> 00:09:50,759 In the interpreter, there was the dictionary 187 00:09:50,759 --> 00:09:52,300 of all your global variables, there's 188 00:09:52,300 --> 00:09:54,420 a dictionary of all your local variables. 189 00:09:54,420 --> 00:09:58,686 And that was-- it was right there. 190 00:09:58,686 --> 00:10:00,310 I mean you could modify the dictionary, 191 00:10:00,310 --> 00:10:01,393 you could do crazy things. 192 00:10:01,393 --> 00:10:03,520 And all the variables were there. 193 00:10:03,520 --> 00:10:06,050 And so they'd match the key to the actual value 194 00:10:06,050 --> 00:10:07,520 stored in the variable. 195 00:10:07,520 --> 00:10:09,920 They don't do that anymore because it's a little slow, 196 00:10:09,920 --> 00:10:12,019 but-- and you could do better in practice. 197 00:10:12,019 --> 00:10:14,310 But at the very least, when you're compiling the thing, 198 00:10:14,310 --> 00:10:16,070 you need a dictionary. 199 00:10:16,070 --> 00:10:20,120 And then, later on, you can do more efficient lookups. 200 00:10:20,120 --> 00:10:20,660 Let's see. 201 00:10:23,580 --> 00:10:26,490 On the internet there are hash tables all over, 202 00:10:26,490 --> 00:10:28,730 like in your router. 203 00:10:28,730 --> 00:10:30,480 Router needs to know all the machines that 204 00:10:30,480 --> 00:10:31,313 are connected to it. 205 00:10:31,313 --> 00:10:33,875 Each machine has an IP address, so when you get a packet in, 206 00:10:33,875 --> 00:10:36,040 and it says, deliver to this IP address, you see, 207 00:10:36,040 --> 00:10:38,060 oh, is it in my dictionary of all the machines 208 00:10:38,060 --> 00:10:39,476 that are directly connected to me? 209 00:10:39,476 --> 00:10:40,949 If so, send it there. 210 00:10:40,949 --> 00:10:42,990 If it's not then it has to find the right subnet. 211 00:10:42,990 --> 00:10:44,573 That's not quite a dictionary problem, 212 00:10:44,573 --> 00:10:45,820 a little more complicated. 213 00:10:45,820 --> 00:10:49,310 But for looking up local machines, it's a dictionary. 214 00:10:49,310 --> 00:10:51,930 Routers use dictionaries because they need to go really fast. 215 00:10:51,930 --> 00:10:55,450 They're getting a billion packets every second. 216 00:10:55,450 --> 00:10:59,190 Also, in the network stack of a machine, 217 00:10:59,190 --> 00:11:01,940 when you come in you get it packet delivered 218 00:11:01,940 --> 00:11:04,980 to a particular port, you need to say, oh, which application, 219 00:11:04,980 --> 00:11:06,880 or which socket is connected to this port? 220 00:11:06,880 --> 00:11:08,630 All of these things are dictionaries. 221 00:11:08,630 --> 00:11:10,088 The point is they're in, basically, 222 00:11:10,088 --> 00:11:12,690 everything you've ever used, virtual memory, 223 00:11:12,690 --> 00:11:15,060 I mean, they're all over the place. 224 00:11:15,060 --> 00:11:16,935 There are also some more subtle applications, 225 00:11:16,935 --> 00:11:18,893 where it is not obvious that's it a dictionary, 226 00:11:18,893 --> 00:11:20,650 but still, we use this idea of hashing 227 00:11:20,650 --> 00:11:22,230 we're going to talk about today. 228 00:11:22,230 --> 00:11:25,700 Like searching in a string. 229 00:11:30,350 --> 00:11:34,810 So when you hit-- I don't know-- in your favorite editor, 230 00:11:34,810 --> 00:11:36,850 you do Control-F, or Control-S, or slash, 231 00:11:36,850 --> 00:11:39,000 or whatever your way of searching for something 232 00:11:39,000 --> 00:11:41,530 is, and you type start typing. 233 00:11:41,530 --> 00:11:43,930 If your editor is clever, it will 234 00:11:43,930 --> 00:11:46,260 use hashing in order to search for that string. 235 00:11:46,260 --> 00:11:49,770 It's a faster way to do it. 236 00:11:49,770 --> 00:11:54,410 If you use grep, for example, in Unix it does it in a fancy way. 237 00:11:54,410 --> 00:11:56,240 Every time you do a Google search 238 00:11:56,240 --> 00:11:58,150 it's essentially using this. 239 00:11:58,150 --> 00:11:59,310 It's solving this problem. 240 00:11:59,310 --> 00:12:01,330 I don't know what algorithm, but we could guess. 241 00:12:01,330 --> 00:12:04,090 Using the algorithms we'll cover in next lecture. 242 00:12:04,090 --> 00:12:06,650 It wouldn't surprise me. 243 00:12:06,650 --> 00:12:08,660 Also, if you have a couple strings 244 00:12:08,660 --> 00:12:14,540 and you want to know what they have in common, how similar 245 00:12:14,540 --> 00:12:15,384 they are? 246 00:12:15,384 --> 00:12:16,800 Example, you have two DNA strings. 247 00:12:16,800 --> 00:12:20,480 You want to see how similar they are, you use hashing. 248 00:12:20,480 --> 00:12:23,830 And you're going to do that in the next problem set, PS4, 249 00:12:23,830 --> 00:12:27,000 which goes out on Thursday. 250 00:12:27,000 --> 00:12:31,990 Also, for things like file and directory synchronization. 251 00:12:38,870 --> 00:12:42,770 So on Unix, if you rsync or unison, or, I guess, 252 00:12:42,770 --> 00:12:46,740 modern day-- these days, Dropbox, MIT 253 00:12:46,740 --> 00:12:49,580 startup-- Whenever you're synchronizing files between two 254 00:12:49,580 --> 00:12:51,010 locations, you use hashing to tell 255 00:12:51,010 --> 00:12:53,260 whether a file has changed, or whether a directory has 256 00:12:53,260 --> 00:12:53,920 changed. 257 00:12:53,920 --> 00:12:56,940 That's a big idea. 258 00:12:56,940 --> 00:12:59,940 Fairly modern idea. 259 00:12:59,940 --> 00:13:02,210 And also in cryptography-- this will 260 00:13:02,210 --> 00:13:07,480 be a topic of next Tuesday's lecture. 261 00:13:07,480 --> 00:13:09,520 If you're transferring a file and you 262 00:13:09,520 --> 00:13:12,070 want to check that you actually transferred that file, 263 00:13:12,070 --> 00:13:15,672 and there wasn't some person in the middle corrupting your file 264 00:13:15,672 --> 00:13:18,005 and making it look like it was what you wanted it to be, 265 00:13:18,005 --> 00:13:21,420 you use something called cryptographic hash functions, 266 00:13:21,420 --> 00:13:24,420 which [INAUDIBLE] will talk about on Tuesday. 267 00:13:24,420 --> 00:13:27,230 So tons of motivation for dictionaries. 268 00:13:27,230 --> 00:13:32,840 Let's actually do it, see how they are done. 269 00:13:35,630 --> 00:13:40,990 We're going to start with sort of a very simple straw man, 270 00:13:40,990 --> 00:13:44,089 and then we're going to improve it until, by the end of today, 271 00:13:44,089 --> 00:13:46,130 we have a really good way to solve the dictionary 272 00:13:46,130 --> 00:13:48,645 problem in constant time for operation. 273 00:13:54,190 --> 00:13:56,980 So the really simple approach is called a direct access table. 274 00:14:00,230 --> 00:14:05,450 So it's just a big table, an array. 275 00:14:05,450 --> 00:14:14,340 You have-- the index into the array is the key. 276 00:14:14,340 --> 00:14:27,530 So, store items in an array, indexed by key. 277 00:14:31,990 --> 00:14:34,200 And in fact, Python kind makes you think about this 278 00:14:34,200 --> 00:14:36,520 because the Python notation for accessing dictionaries 279 00:14:36,520 --> 00:14:40,120 is identical to the notation for accessing arrays. 280 00:14:40,120 --> 00:14:41,810 But with arrays, the keys are restricted 281 00:14:41,810 --> 00:14:45,157 to be non-negative integers, 0 through n minus 1. 282 00:14:45,157 --> 00:14:46,740 So why not just implement it that way? 283 00:14:46,740 --> 00:14:49,140 If your keys happen to be integers 284 00:14:49,140 --> 00:14:52,410 I could just store all my items in a giant array. 285 00:14:52,410 --> 00:14:56,385 So if I just want to store an item here with key 2, 286 00:14:56,385 --> 00:15:00,130 call that, maybe, item 2, I just put that there. 287 00:15:00,130 --> 00:15:03,010 If I want to store something with key 4 288 00:15:03,010 --> 00:15:04,560 I'll just put it there. 289 00:15:04,560 --> 00:15:07,950 Everything else is going to be null, or none, or whatever. 290 00:15:07,950 --> 00:15:09,290 So lots of blank entries. 291 00:15:09,290 --> 00:15:13,380 Whatever keys I don't use I'll just put a null value there. 292 00:15:13,380 --> 00:15:16,000 Every key that I want to put into the dictionary 293 00:15:16,000 --> 00:15:19,620 I'll just store it at the corresponding position. 294 00:15:19,620 --> 00:15:20,720 What's bad about this? 295 00:15:25,060 --> 00:15:25,560 Yeah. 296 00:15:25,560 --> 00:15:28,379 AUDIENCE: It's hard to associate something with just an integer. 297 00:15:28,379 --> 00:15:30,670 PROFESSOR: Hard to associate something with an integer. 298 00:15:30,670 --> 00:15:31,170 Good. 299 00:15:31,170 --> 00:15:33,360 That's one problem. 300 00:15:33,360 --> 00:15:36,100 There's actually two big problems with this structure. 301 00:15:36,100 --> 00:15:37,580 I want both of them. 302 00:15:37,580 --> 00:15:48,040 So bad-- badness number one is keys may not be integers. 303 00:16:00,021 --> 00:16:00,520 Good. 304 00:16:03,070 --> 00:16:04,754 Another problem. 305 00:16:04,754 --> 00:16:05,254 Yeah. 306 00:16:05,254 --> 00:16:06,750 AUDIENCE: Possibility of collision. 307 00:16:06,750 --> 00:16:08,249 PROFESSOR: Possibility of collision. 308 00:16:08,249 --> 00:16:09,540 So here there's no collisions. 309 00:16:09,540 --> 00:16:11,040 We'll get to collisions in a moment, 310 00:16:11,040 --> 00:16:13,020 but a collision is when two items 311 00:16:13,020 --> 00:16:16,330 go to the same slot in this table. 312 00:16:16,330 --> 00:16:19,180 And we defined the problem so there weren't collisions. 313 00:16:19,180 --> 00:16:21,390 We said whenever we insert item with the same key you 314 00:16:21,390 --> 00:16:22,710 overwrite whatever is there. 315 00:16:22,710 --> 00:16:23,630 So collisions are OK. 316 00:16:23,630 --> 00:16:26,040 They will be a problem in a moment, so save your answer. 317 00:16:26,040 --> 00:16:26,540 Yeah? 318 00:16:26,540 --> 00:16:27,415 AUDIENCE: [INAUDIBLE] 319 00:16:29,511 --> 00:16:30,510 PROFESSOR: Running time? 320 00:16:30,510 --> 00:16:32,070 AUDIENCE: [INAUDIBLE] 321 00:16:32,070 --> 00:16:33,200 PROFESSOR: For deletion? 322 00:16:33,200 --> 00:16:35,033 Actually, running time is going to be great. 323 00:16:35,033 --> 00:16:37,860 If I want to insert-- I mean, I do these operations 324 00:16:37,860 --> 00:16:39,900 but on array instead of a dictionary. 325 00:16:39,900 --> 00:16:42,430 So if I want insert I just put something there. 326 00:16:42,430 --> 00:16:44,480 If I want to delete I just set it to null. 327 00:16:44,480 --> 00:16:46,950 If I want to search I just go there and see is it null? 328 00:16:46,950 --> 00:16:47,601 Yeah? 329 00:16:47,601 --> 00:16:49,100 AUDIENCE: It's a gigantic memory hog 330 00:16:49,100 --> 00:16:50,600 PROFESSOR: It's gigantic memory hog. 331 00:16:50,600 --> 00:16:51,835 I like that phrasing. 332 00:16:57,750 --> 00:16:58,920 Not always of course. 333 00:16:58,920 --> 00:17:03,100 If it happens that your keys are-- the set of possible keys 334 00:17:03,100 --> 00:17:06,470 is not too giant then life is good. 335 00:17:06,470 --> 00:17:08,593 Let's see If I cannot kill somebody today. 336 00:17:08,593 --> 00:17:09,859 Oh yes. 337 00:17:09,859 --> 00:17:11,650 Very good. 338 00:17:11,650 --> 00:17:13,490 But if you have a lot of keys, you 339 00:17:13,490 --> 00:17:17,849 need one slot in your array per key. 340 00:17:17,849 --> 00:17:19,290 That could be a lot. 341 00:17:19,290 --> 00:17:23,920 Maybe your keys are 64-bit integers. 342 00:17:23,920 --> 00:17:28,089 Then you need 264 slots just to store one measly dictionary. 343 00:17:28,089 --> 00:17:30,210 That's huge. 344 00:17:30,210 --> 00:17:33,000 I guess there's also the running time of initialize that. 345 00:17:33,000 --> 00:17:35,530 But at the very least, you have huge space hog. 346 00:17:35,530 --> 00:17:37,410 This is bad. 347 00:17:37,410 --> 00:17:40,820 So we're going to fix both of these problems one at a time. 348 00:17:40,820 --> 00:17:43,560 First problem we're going to talk about 349 00:17:43,560 --> 00:17:45,991 is what if your keys aren't integers? 350 00:17:45,991 --> 00:17:47,490 Because if your keys aren't integers 351 00:17:47,490 --> 00:17:48,430 you can't use this at all. 352 00:17:48,430 --> 00:17:50,179 So lets at least get something that works. 353 00:17:58,620 --> 00:18:00,410 And this is a notion called prehashing. 354 00:18:03,157 --> 00:18:05,240 I guess different people call it different things. 355 00:18:05,240 --> 00:18:07,800 Unfortunately Python calls it hash. 356 00:18:07,800 --> 00:18:11,710 It's not hashing, it's prehashing. 357 00:18:11,710 --> 00:18:13,960 Emphasized the "pre" here. 358 00:18:13,960 --> 00:18:19,250 So prehash function maps whatever keys 359 00:18:19,250 --> 00:18:23,106 you have to non-negative integers. 360 00:18:28,314 --> 00:18:30,230 At this point we're not worrying about how big 361 00:18:30,230 --> 00:18:31,021 those integers are. 362 00:18:31,021 --> 00:18:32,270 They could be giant. 363 00:18:32,270 --> 00:18:34,920 We're not going to fix the second problem til later. 364 00:18:34,920 --> 00:18:37,810 First problem is if I have some key, maybe it's a string, 365 00:18:37,810 --> 00:18:42,682 it's whatever, it's an object, how do I map it to some integer 366 00:18:42,682 --> 00:18:44,390 so I could, at least in principle, put it 367 00:18:44,390 --> 00:18:48,052 in a direct access table. 368 00:18:48,052 --> 00:18:50,010 There's a theoretical answer to how to do this, 369 00:18:50,010 --> 00:18:52,560 and then there's the practical answer. how to do this. 370 00:18:52,560 --> 00:18:55,710 I'll start with the mathematical. 371 00:18:55,710 --> 00:19:04,725 In theory, I like this, keys are finite and discrete. 372 00:19:08,011 --> 00:19:08,510 OK. 373 00:19:08,510 --> 00:19:10,580 We know that anything on the computer 374 00:19:10,580 --> 00:19:13,590 could, ultimately, be written down as a string of bits. 375 00:19:13,590 --> 00:19:16,405 So a string of bits represents an integer. 376 00:19:16,405 --> 00:19:17,540 So we're done. 377 00:19:24,160 --> 00:19:27,840 So in theory, this is easy. 378 00:19:27,840 --> 00:19:30,211 And we're going to assume in this class, 379 00:19:30,211 --> 00:19:31,710 because it's sort of a theory class, 380 00:19:31,710 --> 00:19:33,202 that this is what's happening. 381 00:19:33,202 --> 00:19:34,660 At least for analysis, we're always 382 00:19:34,660 --> 00:19:37,040 going to analyze things as if this is what's happening. 383 00:19:37,040 --> 00:19:39,070 Now in reality, people don't always do this. 384 00:19:39,070 --> 00:19:44,060 In particular-- I'll go somewhere else. 385 00:19:44,060 --> 00:20:05,817 In Python it's not quite so simple, 386 00:20:05,817 --> 00:20:07,650 but at least you get to see what's going on. 387 00:20:07,650 --> 00:20:10,940 There's a function called hash, which should be called prehash, 388 00:20:10,940 --> 00:20:13,990 and it, given an object, it produces 389 00:20:13,990 --> 00:20:16,580 a non-- I'm not sure, actually, if it's non-negative. 390 00:20:16,580 --> 00:20:19,720 It's not a big deal if it has a minus sign because then you 391 00:20:19,720 --> 00:20:21,770 could just use this and get rid of the sign. 392 00:20:21,770 --> 00:20:24,590 But it maps every object to an integer, 393 00:20:24,590 --> 00:20:27,217 or every hashable object, technically. 394 00:20:27,217 --> 00:20:28,800 But pretty much anything can be mapped 395 00:20:28,800 --> 00:20:31,350 to an integer, one way or another. 396 00:20:31,350 --> 00:20:33,350 And so for example, if you given it an integer 397 00:20:33,350 --> 00:20:35,040 it just returns the integer. 398 00:20:35,040 --> 00:20:36,220 So that's pretty easy. 399 00:20:36,220 --> 00:20:39,300 If you give it a string it does something. 400 00:20:39,300 --> 00:20:40,730 I don't know exactly what it does, 401 00:20:40,730 --> 00:20:41,813 but there are some issues. 402 00:20:41,813 --> 00:20:51,668 For example, hash of this string, backslash 0B 403 00:20:51,668 --> 00:21:02,617 is equal to the hash of backslash 0 backslash 0C 64. 404 00:21:02,617 --> 00:21:04,450 It's a little tricky to find these examples, 405 00:21:04,450 --> 00:21:06,140 but they're out there. 406 00:21:06,140 --> 00:21:08,390 And I guess, this is probably the lowest one 407 00:21:08,390 --> 00:21:10,640 in a certain measure. 408 00:21:10,640 --> 00:21:12,462 So it's a concern. 409 00:21:12,462 --> 00:21:14,670 In practice you have to be careful about these things 410 00:21:14,670 --> 00:21:17,540 because what you'd like-- in an ideal world, 411 00:21:17,540 --> 00:21:25,980 and in the theoretical world-- this prehash function of x, 412 00:21:25,980 --> 00:21:27,820 if it equals the prehash function of y, 413 00:21:27,820 --> 00:21:31,380 this should only happen when x=y, 414 00:21:31,380 --> 00:21:32,630 when they're the same thing. 415 00:21:35,450 --> 00:21:40,100 And equals equal sense, I guess, would be the technical version. 416 00:21:40,100 --> 00:21:42,830 Sadly, in Python this is not quite true. 417 00:21:42,830 --> 00:21:43,960 But mostly true. 418 00:21:48,030 --> 00:21:50,420 Let's see. 419 00:21:50,420 --> 00:21:53,460 If you define a custom object, you may know this, 420 00:21:53,460 --> 00:21:58,020 there is an __hash__ method you can implement, 421 00:21:58,020 --> 00:22:01,740 which tells Python what to do when you call hash 422 00:22:01,740 --> 00:22:02,480 of your object. 423 00:22:02,480 --> 00:22:05,380 If you don't, it uses the default 424 00:22:05,380 --> 00:22:08,060 of id, which is the physical location 425 00:22:08,060 --> 00:22:09,159 of your object in memory. 426 00:22:09,159 --> 00:22:11,450 So as long as your object isn't moving around in memory 427 00:22:11,450 --> 00:22:13,130 this is a pretty good hash function 428 00:22:13,130 --> 00:22:17,850 because no two items occupy the same space in memory. 429 00:22:17,850 --> 00:22:21,430 So that's just implementation side of things. 430 00:22:21,430 --> 00:22:28,010 Other implementation side of things is in Python, 431 00:22:28,010 --> 00:22:31,070 well, there's this distinction between objects and keys, 432 00:22:31,070 --> 00:22:32,020 I guess you would say. 433 00:22:32,020 --> 00:22:33,980 You really don't want this prehash function 434 00:22:33,980 --> 00:22:36,370 to change value. 435 00:22:36,370 --> 00:22:38,710 In, say, a direct access table, if you store-- 436 00:22:38,710 --> 00:22:41,260 you take an item, you compute the prehash function 437 00:22:41,260 --> 00:22:45,390 of the key in there, and you throw it in, and it says, 438 00:22:45,390 --> 00:22:47,615 oh, prehash value is four. 439 00:22:47,615 --> 00:22:48,990 Then you put it in position four. 440 00:22:48,990 --> 00:22:52,280 If that value change, then when you go to search for that key, 441 00:22:52,280 --> 00:22:54,780 and you call prehash of that thing, and if it give you five, 442 00:22:54,780 --> 00:22:57,570 you look in position five, and you say, oh, it's not there. 443 00:22:57,570 --> 00:23:00,070 So prehash really should not change. 444 00:23:00,070 --> 00:23:03,140 If you ever implement this function don't mess with it. 445 00:23:03,140 --> 00:23:05,260 I mean, make sure it's defined in such a way 446 00:23:05,260 --> 00:23:06,970 that it doesn't change over time. 447 00:23:06,970 --> 00:23:10,622 Otherwise, you won't be able to find your items in the table. 448 00:23:10,622 --> 00:23:12,080 Python can't protect you from that. 449 00:23:15,320 --> 00:23:17,800 This is why, for example, if you have a list, 450 00:23:17,800 --> 00:23:20,530 which is a mutable object, you cannot put it into a hash table 451 00:23:20,530 --> 00:23:25,370 as a key value because it would change over time. 452 00:23:25,370 --> 00:23:29,740 Potentially, you'd append to the list, or whatever. 453 00:23:29,740 --> 00:23:31,680 All right. 454 00:23:31,680 --> 00:23:33,920 So hopefully you're reasonably happy with this. 455 00:23:33,920 --> 00:23:34,990 You could also think of it is we're 456 00:23:34,990 --> 00:23:36,948 going to assume keys are non-negative integers. 457 00:23:36,948 --> 00:23:38,755 But in practice, anything you have you 458 00:23:38,755 --> 00:23:42,770 can map to an integer, one way or another. 459 00:23:42,770 --> 00:23:44,380 The bigger problem in a certain sense, 460 00:23:44,380 --> 00:23:48,780 or the more interesting problem is reducing space. 461 00:23:48,780 --> 00:23:49,860 So how do we do that? 462 00:23:58,420 --> 00:23:59,740 This would be hashing. 463 00:24:03,880 --> 00:24:06,840 This is sort of the magic part of today's lecture. 464 00:24:06,840 --> 00:24:09,200 In case you're wondering, hashing 465 00:24:09,200 --> 00:24:12,010 has nothing to do with hashish. 466 00:24:12,010 --> 00:24:17,610 Hashish is a Arabic root word unrelated to the Germanic, 467 00:24:17,610 --> 00:24:20,220 which is hachet, I believe. 468 00:24:20,220 --> 00:24:20,900 Yeah. 469 00:24:20,900 --> 00:24:23,340 Or hacheh-- I guess, something like that. 470 00:24:23,340 --> 00:24:24,530 I'm not very good at German. 471 00:24:24,530 --> 00:24:25,910 Which means hatchet. 472 00:24:25,910 --> 00:24:26,410 OK 473 00:24:26,410 --> 00:24:28,400 It's like you take your key, and you cut it up 474 00:24:28,400 --> 00:24:31,060 into little pieces, and you mix them around and cut and dice, 475 00:24:31,060 --> 00:24:32,570 and it's like cooking. 476 00:24:32,570 --> 00:24:33,511 OK. 477 00:24:33,511 --> 00:24:34,010 What? 478 00:24:34,010 --> 00:24:34,900 AUDIENCE: Hash browns. 479 00:24:34,900 --> 00:24:36,400 PROFESSOR: Hash browns, for example. 480 00:24:36,400 --> 00:24:38,281 Yeah, same root. 481 00:24:38,281 --> 00:24:38,780 OK. 482 00:24:38,780 --> 00:24:41,611 It's like the only two English words with that kind of hash. 483 00:24:41,611 --> 00:24:42,110 OK. 484 00:24:42,110 --> 00:24:45,130 In our case, it's a verb, to hash. 485 00:24:45,130 --> 00:24:47,960 It means to cut into pieces and mix around. 486 00:24:47,960 --> 00:24:48,460 OK. 487 00:24:48,460 --> 00:24:51,130 That won't really be clear until towards the end of today's 488 00:24:51,130 --> 00:24:52,600 lecture, but we will eventually get 489 00:24:52,600 --> 00:24:55,140 to the etymology of hashing. 490 00:24:55,140 --> 00:24:58,060 Or, we've got the etymology, but why it's, actually, 491 00:24:58,060 --> 00:24:59,820 why we use that term. 492 00:24:59,820 --> 00:25:00,370 All right. 493 00:25:00,370 --> 00:25:10,860 So the big idea is we take all possible keys 494 00:25:10,860 --> 00:25:13,975 and we want to reduce them down to some small, small set 495 00:25:13,975 --> 00:25:14,475 of integers. 496 00:25:43,700 --> 00:25:45,810 Let me draw a picture of that. 497 00:25:55,640 --> 00:26:01,765 So we have this giant space of all possible keys. 498 00:26:01,765 --> 00:26:03,060 We'll call this key space. 499 00:26:06,080 --> 00:26:08,230 It's like outer space, basically. 500 00:26:08,230 --> 00:26:10,790 It's giant. 501 00:26:10,790 --> 00:26:12,670 And if we stored a direct access table, 502 00:26:12,670 --> 00:26:13,730 this would also be giant. 503 00:26:13,730 --> 00:26:16,530 And we don't want to do that. 504 00:26:16,530 --> 00:26:21,500 We'd like to somehow map using a hash function h down 505 00:26:21,500 --> 00:26:22,800 to some smaller set. 506 00:26:22,800 --> 00:26:25,310 How do I want to draw this? 507 00:26:25,310 --> 00:26:25,980 Like an array. 508 00:26:30,820 --> 00:26:36,960 So we're going to have possible values 0 up to m minus 1. 509 00:26:36,960 --> 00:26:38,340 m is a new thing. 510 00:26:38,340 --> 00:26:40,400 It's going to be the size of our hash table. 511 00:26:40,400 --> 00:26:41,630 Let's call the hash table. 512 00:26:45,107 --> 00:26:48,200 I think we'll call it t also. 513 00:26:48,200 --> 00:26:51,230 And we'd somehow like to map-- 514 00:26:51,230 --> 00:26:51,730 All right. 515 00:26:51,730 --> 00:26:54,800 So there's a giant space of all possible keys, 516 00:26:54,800 --> 00:26:57,900 but then there's a subset of keys that are actually 517 00:26:57,900 --> 00:27:03,310 stored in this set, in this dictionary. 518 00:27:03,310 --> 00:27:05,160 At any moment in time there's some set 519 00:27:05,160 --> 00:27:07,730 of keys that are present. 520 00:27:07,730 --> 00:27:10,290 That set changes, but at any moment 521 00:27:10,290 --> 00:27:12,780 there's some keys that are actually there. 522 00:27:12,780 --> 00:27:17,180 k1, k2, k3, k4. 523 00:27:17,180 --> 00:27:20,890 I'd like to map them to positions in this table. 524 00:27:20,890 --> 00:27:26,590 So maybe I store k2-- or actually, item 2 would go here. 525 00:27:26,590 --> 00:27:34,000 In particular, this is when h of k2, if it equals zero, 526 00:27:34,000 --> 00:27:36,000 then you'd put item 2 there. 527 00:27:36,000 --> 00:27:39,780 Item 3, let's say, it's at position-- wow, 528 00:27:39,780 --> 00:27:42,240 3 would be a bit of a coincidence, but what the hell. 529 00:27:42,240 --> 00:27:46,630 Maybe h or k3 equals 3. 530 00:27:46,630 --> 00:27:48,030 Then you'd put item 3 here. 531 00:27:50,540 --> 00:27:51,040 OK. 532 00:27:51,040 --> 00:27:51,750 You get the idea. 533 00:27:51,750 --> 00:27:54,180 So these four items each have a special position 534 00:27:54,180 --> 00:27:55,530 in their table. 535 00:27:55,530 --> 00:28:02,880 And the idea is we would like to be, m to be around n. 536 00:28:07,550 --> 00:28:19,280 n is the number of keys In the dictionary right now. 537 00:28:19,280 --> 00:28:21,635 So if we could achieve that, the size of the table 538 00:28:21,635 --> 00:28:23,760 was proportional to the number of keys being stored 539 00:28:23,760 --> 00:28:26,970 in the dictionary, that would be good news because then 540 00:28:26,970 --> 00:28:30,110 the space is not gigantic and hoggish. 541 00:28:30,110 --> 00:28:33,657 It would just be linear, which is optimal. 542 00:28:33,657 --> 00:28:35,490 So if we want to store m things, maybe we'll 543 00:28:35,490 --> 00:28:38,630 use 2m space, a 3m space, but not much more. 544 00:28:41,740 --> 00:28:45,140 How the heck are we going to define such a function h? 545 00:28:45,140 --> 00:28:47,560 Well, that's the rest of the lecture. 546 00:28:47,560 --> 00:28:49,224 But even before we define a function h, 547 00:28:49,224 --> 00:28:50,640 do you see any problems with this? 548 00:28:55,580 --> 00:28:56,146 Yeah. 549 00:28:56,146 --> 00:28:57,062 AUDIENCE: [INAUDIBLE]. 550 00:29:02,764 --> 00:29:03,430 PROFESSOR: Yeah. 551 00:29:03,430 --> 00:29:05,560 This space over here, this is pigeonhole principle. 552 00:29:05,560 --> 00:29:07,570 The number of slots for your pigeons over here 553 00:29:07,570 --> 00:29:10,240 is way smaller than the number of possible pigeons. 554 00:29:10,240 --> 00:29:13,110 So there are going to be two keys that 555 00:29:13,110 --> 00:29:16,990 map to the same slot in the hash table. 556 00:29:16,990 --> 00:29:18,365 This is what we call a collision. 557 00:29:21,190 --> 00:29:24,500 Let's call this, I don't know, ki, kj. 558 00:29:28,047 --> 00:29:35,500 h of ki equals h of kj, but the keys are different. 559 00:29:35,500 --> 00:29:40,920 So ki does not equal kj, yet their hash functions 560 00:29:40,920 --> 00:29:42,840 are the same, hash values are the same. 561 00:29:42,840 --> 00:29:44,660 We call that a collision. 562 00:29:44,660 --> 00:29:48,990 And that's guaranteed to happen a lot, yet somehow, 563 00:29:48,990 --> 00:29:51,097 we can still make this work. 564 00:29:51,097 --> 00:29:51,805 That's the magic. 565 00:29:57,640 --> 00:29:59,350 And that is going to be chaining. 566 00:29:59,350 --> 00:30:02,080 We've done these guys. 567 00:30:02,080 --> 00:30:05,264 Next up is a technique for dealing with collisions. 568 00:30:05,264 --> 00:30:07,430 There are two techniques for dealing with collisions 569 00:30:07,430 --> 00:30:09,762 we're going to talk about in 006. 570 00:30:09,762 --> 00:30:11,720 One is called chaining, and next Tuesday, we'll 571 00:30:11,720 --> 00:30:15,450 see another method called open addressing. 572 00:30:15,450 --> 00:30:17,170 But let's start with chaining. 573 00:30:21,220 --> 00:30:24,730 The idea with chaining a simple. 574 00:30:24,730 --> 00:30:28,400 If you have multiple items here all with the same-- that 575 00:30:28,400 --> 00:30:32,860 hash to the same position, just store them as a list. 576 00:30:32,860 --> 00:30:35,050 I'm going to draw it as a linked list. 577 00:31:02,850 --> 00:31:06,688 I think I need a big picture here. 578 00:31:27,710 --> 00:31:35,270 So we have our nice universe, various keys that we actually 579 00:31:35,270 --> 00:31:37,700 have present. 580 00:31:37,700 --> 00:31:42,740 So these are the keys in the dictionary, 581 00:31:42,740 --> 00:31:44,280 and this is all of key space. 582 00:31:53,170 --> 00:31:56,170 These guys map to slots in the table. 583 00:31:56,170 --> 00:31:58,490 Some of them might map to the same value. 584 00:31:58,490 --> 00:32:04,975 So let's say k1 and k2, suppose they collide. 585 00:32:04,975 --> 00:32:06,620 So they both go this slot. 586 00:32:06,620 --> 00:32:11,230 What we're going to store here is a linked list 587 00:32:11,230 --> 00:32:16,750 that stores item 1, and stores a pointer 588 00:32:16,750 --> 00:32:21,450 to the next item, which is item 2. 589 00:32:21,450 --> 00:32:23,380 And that's the end of the list. 590 00:32:23,380 --> 00:32:27,160 Or you could-- however you want to draw a null. 591 00:32:27,160 --> 00:32:30,320 So however many items there are, we're 592 00:32:30,320 --> 00:32:33,700 going to have a linked list of that length in that slot. 593 00:32:33,700 --> 00:32:37,440 So in particular, if there's just one item, like say, 594 00:32:37,440 --> 00:32:42,430 this k3 here, maybe it just maps to this slot. 595 00:32:42,430 --> 00:32:44,350 And maybe that's all that maps to that slot. 596 00:32:44,350 --> 00:32:48,330 In that case, we just say, follow this item 3, 597 00:32:48,330 --> 00:32:50,370 and there's no other items. 598 00:32:50,370 --> 00:32:52,680 Some slots are going to be completely empty. 599 00:32:52,680 --> 00:32:56,440 There nothing there so you just store a null pointer. 600 00:32:56,440 --> 00:32:58,150 That is hashing with chaining. 601 00:32:58,150 --> 00:33:02,350 It's pretty simple, very simple really. 602 00:33:02,350 --> 00:33:05,550 The only question is why would you expect it to be any good? 603 00:33:05,550 --> 00:33:08,960 Because, in the worst case, if you fix your hash function 604 00:33:08,960 --> 00:33:11,920 here, h, there's going to be a whole bunch of keys 605 00:33:11,920 --> 00:33:13,370 that all map to the same slot. 606 00:33:13,370 --> 00:33:16,330 And so in the worst case, those are the keys that you insert, 607 00:33:16,330 --> 00:33:17,747 and they all go here. 608 00:33:17,747 --> 00:33:19,580 And then you have this fancy data structure. 609 00:33:19,580 --> 00:33:23,100 And in the end, all you have is a linked list of all n items. 610 00:33:23,100 --> 00:33:30,950 So the worst case is theta n. 611 00:33:30,950 --> 00:33:34,520 And this is going to be true for any hashing scheme, actually. 612 00:33:34,520 --> 00:33:36,710 In the worst case, hashing sucks. 613 00:33:36,710 --> 00:33:39,400 Yet in practice, it works really, really well. 614 00:33:39,400 --> 00:33:41,960 And the reason is randomization, essentially, 615 00:33:41,960 --> 00:33:45,620 that this hash function, unless you're really unlucky, 616 00:33:45,620 --> 00:33:48,270 the hash function will nicely distribute your items, 617 00:33:48,270 --> 00:33:52,700 and most of these lists will have constant length. 618 00:33:52,700 --> 00:34:00,720 We're going to prove that under an assumption. 619 00:34:00,720 --> 00:34:02,380 Well have to warm up a little bit. 620 00:34:07,000 --> 00:34:09,739 But I'm also going to cop out a little m as you'll see. 621 00:34:22,960 --> 00:34:27,250 So in 006 we're going to make an assumption called Simple 622 00:34:27,250 --> 00:34:29,080 Uniform Hashing. 623 00:34:29,080 --> 00:34:31,255 OK. 624 00:34:31,255 --> 00:34:35,850 And this is an assumption, it's an unrealistic assumption. 625 00:34:35,850 --> 00:34:40,330 I would go so far as to say it's false, a false assumption. 626 00:34:40,330 --> 00:34:42,236 But it's really convenient for analysis, 627 00:34:42,236 --> 00:34:43,610 and it's going to make it obvious 628 00:34:43,610 --> 00:34:45,580 why chaining is a good idea. 629 00:34:45,580 --> 00:34:48,139 Sadly, the assumption isn't quite true, 630 00:34:48,139 --> 00:34:49,969 but it gives you a flavor. 631 00:34:49,969 --> 00:34:52,080 If you want to see why hashing is actually good, 632 00:34:52,080 --> 00:34:53,955 I'm going to hint at it at the end of lecture 633 00:34:53,955 --> 00:34:55,611 but really should take 6.046 Yeah. 634 00:34:55,611 --> 00:34:56,902 AUDIENCE: [INAUDIBLE] question. 635 00:34:56,902 --> 00:34:59,182 Is the hashing function [INAUDIBLE]? 636 00:34:59,182 --> 00:35:01,348 Like, how do we know the array is still [INAUDIBLE]? 637 00:35:01,348 --> 00:35:01,931 PROFESSOR: OK. 638 00:35:01,931 --> 00:35:07,620 The hashing function-- I guess I didn't specify up here. 639 00:35:07,620 --> 00:35:14,160 The hashing function maps your universe to 0, 1, 640 00:35:14,160 --> 00:35:17,520 up to m minus 1, That's the definition. 641 00:35:17,520 --> 00:35:23,090 So it's guaranteed to reduce the space of keys to just m slots. 642 00:35:23,090 --> 00:35:25,467 So your hashing function needs to know what m is. 643 00:35:25,467 --> 00:35:27,800 In reality there's not going to be one hashing function, 644 00:35:27,800 --> 00:35:30,669 there's going to be 1 for each m, or at least one for each m. 645 00:35:30,669 --> 00:35:32,460 And so, depending on how big your table is, 646 00:35:32,460 --> 00:35:34,337 you use the corresponding hash function. 647 00:35:34,337 --> 00:35:35,170 Yeah, good question. 648 00:35:35,170 --> 00:35:36,545 So the hash function is what does 649 00:35:36,545 --> 00:35:39,110 the work of reducing your key space down 650 00:35:39,110 --> 00:35:40,730 to small set of slots. 651 00:35:40,730 --> 00:35:44,180 So that's what's going to give us low space. 652 00:35:44,180 --> 00:35:44,850 OK. 653 00:35:44,850 --> 00:35:47,006 But now, how do we get low time? 654 00:35:47,006 --> 00:35:49,255 Let me just state this assumption and get to business. 655 00:36:33,300 --> 00:36:35,510 Simply, uniform hashing is, essentially, 656 00:36:35,510 --> 00:36:38,050 two probabilistic assumptions. 657 00:36:38,050 --> 00:36:41,360 The first one is uniformity. 658 00:36:41,360 --> 00:36:44,070 If you take some key in your space 659 00:36:44,070 --> 00:36:46,170 that you want to store the hash function 660 00:36:46,170 --> 00:36:49,230 maps it to a uniform random choice. 661 00:36:49,230 --> 00:36:51,540 This is, of course, is what you want to happen. 662 00:36:51,540 --> 00:36:58,271 Each of these slots here is equally likely to be hashed to. 663 00:36:58,271 --> 00:36:58,770 OK. 664 00:36:58,770 --> 00:37:00,020 That's a good start. 665 00:37:00,020 --> 00:37:03,550 But to do proper analysis, not only do we uniformity, 666 00:37:03,550 --> 00:37:05,745 we also need independence. 667 00:37:05,745 --> 00:37:07,870 So not only is this true for each key individually, 668 00:37:07,870 --> 00:37:10,210 but it's true for all the keys together. 669 00:37:10,210 --> 00:37:13,500 So if key one maps to a uniform random place, 670 00:37:13,500 --> 00:37:16,840 no matter where it goes, key two also 671 00:37:16,840 --> 00:37:18,270 matches to a uniform random place. 672 00:37:18,270 --> 00:37:19,644 And no matter where those two go, 673 00:37:19,644 --> 00:37:22,040 key three maps to a uniform random place. 674 00:37:22,040 --> 00:37:23,640 This really can't be true. 675 00:37:23,640 --> 00:37:27,830 But if it's true, we can prove that this takes constant time. 676 00:37:27,830 --> 00:37:29,500 So let me do that. 677 00:37:41,180 --> 00:37:45,660 So under this assumption, we can analyze 678 00:37:45,660 --> 00:37:51,400 hashing-- hashing with chaining is what this method is called. 679 00:37:51,400 --> 00:37:56,400 So let's do it 680 00:37:56,400 --> 00:37:59,319 I want to know-- I got to cheat, sorry. 681 00:37:59,319 --> 00:38:00,610 I got to remember the notation. 682 00:38:03,690 --> 00:38:05,460 I don't have any good notation here. 683 00:38:05,460 --> 00:38:08,100 All right. 684 00:38:08,100 --> 00:38:12,965 What I'd like to know is the expected length of a chain. 685 00:38:18,460 --> 00:38:18,960 OK. 686 00:38:18,960 --> 00:38:25,160 Now this is if I have n keys that are stored in the table, 687 00:38:25,160 --> 00:38:29,220 and m slots in the table, then what 688 00:38:29,220 --> 00:38:32,030 is the expected length of a chain? 689 00:38:32,030 --> 00:38:33,131 Any suggestions. 690 00:38:33,131 --> 00:38:33,630 Yeah. 691 00:38:33,630 --> 00:38:35,870 AUDIENCE: 1 over m to the n. 692 00:38:35,870 --> 00:38:37,480 PROFESSOR: 1 over m to the n? 693 00:38:37,480 --> 00:38:41,700 That's going to be a probability of something. 694 00:38:41,700 --> 00:38:42,200 Not quite. 695 00:38:42,200 --> 00:38:43,370 AUDIENCE: [INAUDIBLE] 696 00:38:43,370 --> 00:38:44,786 PROFESSOR: That's between 0 and 1. 697 00:38:44,786 --> 00:38:47,100 It's probably at least one, or something. 698 00:38:47,100 --> 00:38:47,855 Yeah. 699 00:38:47,855 --> 00:38:49,190 AUDIENCE: m over n. 700 00:38:49,190 --> 00:38:51,136 PROFESSOR: n over m, yeah. 701 00:38:54,630 --> 00:38:56,070 It's really easy. 702 00:38:56,070 --> 00:39:00,010 The chance of a key going to a particular slot is 1 over m. 703 00:39:00,010 --> 00:39:03,020 They're all independent, so it's 1 over m, plus 1 over m, 704 00:39:03,020 --> 00:39:05,160 plus 1 over m, n times. 705 00:39:05,160 --> 00:39:07,100 So it's n over m. 706 00:39:07,100 --> 00:39:10,730 This is really easy when you have independence. 707 00:39:10,730 --> 00:39:13,210 Sadly, in the real world, you don't have independence. 708 00:39:13,210 --> 00:39:15,806 We're going to call this thing alpha, 709 00:39:15,806 --> 00:39:21,560 and it's also known as the load factor of the table. 710 00:39:21,560 --> 00:39:24,960 So if it's one, n equals m. 711 00:39:24,960 --> 00:39:27,650 And so the length of a chain is one. 712 00:39:27,650 --> 00:39:31,350 If it's 10, then you have 10 times as many elements 713 00:39:31,350 --> 00:39:32,130 as you have slots. 714 00:39:32,130 --> 00:39:34,560 But still, the expected length of a chain is 10. 715 00:39:34,560 --> 00:39:35,660 That's a constant. 716 00:39:35,660 --> 00:39:36,470 It's OK. 717 00:39:36,470 --> 00:39:38,664 If it's a 12, that's OK. 718 00:39:38,664 --> 00:39:41,080 It means that you have a bigger table than you have items. 719 00:39:41,080 --> 00:39:45,210 As long as it's a constant, as long as we have-- I 720 00:39:45,210 --> 00:39:49,817 erased it by now-- as long as m is theta n, 721 00:39:49,817 --> 00:39:51,025 this is going to be constant. 722 00:39:55,730 --> 00:39:57,710 And so we need to maintain this property. 723 00:39:57,710 --> 00:40:00,290 But as long as you set your table size to the right value, 724 00:40:00,290 --> 00:40:04,730 to be roughly n, this will be constant. 725 00:40:04,730 --> 00:40:12,900 And so the running time of an operation, insert, delete, 726 00:40:12,900 --> 00:40:17,480 and search-- Well, search is really 727 00:40:17,480 --> 00:40:20,430 the hardest because when you want to search for a key, 728 00:40:20,430 --> 00:40:24,700 you map it into your table, then you walk the linked list 729 00:40:24,700 --> 00:40:26,692 and look for the key that you're searching for. 730 00:40:26,692 --> 00:40:28,400 Now is this the key you're searching for? 731 00:40:28,400 --> 00:40:30,350 No, it's not the key you're searching for. 732 00:40:30,350 --> 00:40:31,957 Is this the key you're searching for? 733 00:40:31,957 --> 00:40:33,790 Those are not the keys you're searching for. 734 00:40:33,790 --> 00:40:34,510 You keep going. 735 00:40:34,510 --> 00:40:36,650 Either you find your key or you don't. 736 00:40:36,650 --> 00:40:40,267 But in the worst case, you have to walk the entire list. 737 00:40:40,267 --> 00:40:42,350 Sorry for the bad Star Trek reference-- Star Wars. 738 00:40:42,350 --> 00:40:45,110 God. 739 00:40:45,110 --> 00:40:45,930 I'm not awake. 740 00:40:45,930 --> 00:40:48,370 All right. 741 00:40:48,370 --> 00:40:50,820 In general, the running time, in the worst case, 742 00:40:50,820 --> 00:40:53,785 is 1 plus the length of your chain. 743 00:40:56,330 --> 00:40:56,830 OK. 744 00:40:56,830 --> 00:40:59,340 So it's going to be 1 plus alpha. 745 00:40:59,340 --> 00:41:00,970 Why do I write one? 746 00:41:00,970 --> 00:41:04,930 Well, because alpha can be much smaller than 1, in general. 747 00:41:04,930 --> 00:41:06,420 And you always have to pay the cost 748 00:41:06,420 --> 00:41:07,810 of computing the hash function. 749 00:41:07,810 --> 00:41:10,770 We're going to assume that takes constant time. 750 00:41:10,770 --> 00:41:13,200 And then you have to follow the first pointer. 751 00:41:13,200 --> 00:41:17,590 So you always pay constant time, but then you also pay alpha. 752 00:41:17,590 --> 00:41:20,470 That's your expected life. 753 00:41:20,470 --> 00:41:20,970 OK. 754 00:41:20,970 --> 00:41:21,930 That's the analysis. 755 00:41:21,930 --> 00:41:23,080 It's super simple. 756 00:41:23,080 --> 00:41:25,860 If you assume Simple Uniform Hashing, 757 00:41:25,860 --> 00:41:30,550 it's clear, as long as your load factor is constant, m theta n, 758 00:41:30,550 --> 00:41:33,490 you get constant running time for all your operations. 759 00:41:33,490 --> 00:41:34,530 Life is good. 760 00:41:34,530 --> 00:41:37,010 This is the intuition of why hashing works. 761 00:41:37,010 --> 00:41:39,140 It's not really why hashing works. 762 00:41:39,140 --> 00:41:43,176 But it's about as far as we're going to get in 006. 763 00:41:43,176 --> 00:41:44,800 I'm going to tell you a little bit more 764 00:41:44,800 --> 00:41:49,380 about why hashing is actually good to practice and in theory. 765 00:42:06,820 --> 00:42:10,020 What are we up to? 766 00:42:10,020 --> 00:42:12,740 Last topic is hash functions. 767 00:42:12,740 --> 00:42:16,380 The one remaining thing is how do I construct h? 768 00:42:16,380 --> 00:42:19,800 How do I actually map from this giant universe of keys 769 00:42:19,800 --> 00:42:24,961 to this small set of slots in the table, there's m of them? 770 00:42:29,260 --> 00:42:34,140 I'm going to give you three hash functions, two of which are, 771 00:42:34,140 --> 00:42:37,210 let's say, common practice, and the third of which is actually 772 00:42:37,210 --> 00:42:38,710 theoretically good. 773 00:42:38,710 --> 00:42:40,930 So the first two are not good theoretically. 774 00:42:40,930 --> 00:42:43,060 You can prove that they're bad, but at least they 775 00:42:43,060 --> 00:42:45,190 give you some flavor, and they're 776 00:42:45,190 --> 00:42:51,979 still common in practice because a lot of the time they're OK, 777 00:42:51,979 --> 00:42:53,770 but you can't really prove much about them. 778 00:42:56,490 --> 00:42:56,990 OK. 779 00:42:56,990 --> 00:43:03,000 So first method, sort of the obvious one, 780 00:43:03,000 --> 00:43:04,940 called the division method. 781 00:43:04,940 --> 00:43:06,820 And if you have a key, this could 782 00:43:06,820 --> 00:43:09,950 be a giant key, huge universe of keys, 783 00:43:09,950 --> 00:43:14,065 you just take that key, modulo m, 784 00:43:14,065 --> 00:43:16,190 that gives you a number between zero and m minus 1. 785 00:43:16,190 --> 00:43:17,110 Done. 786 00:43:17,110 --> 00:43:19,542 It's so easy. 787 00:43:19,542 --> 00:43:21,000 I'm not going to tell you in detail 788 00:43:21,000 --> 00:43:22,660 why this is a bad method. 789 00:43:22,660 --> 00:43:24,060 Maybe you can think about it. 790 00:43:24,060 --> 00:43:29,890 It's especially bad if m has some common factors with k. 791 00:43:29,890 --> 00:43:32,980 Like, let's say k is even always, 792 00:43:32,980 --> 00:43:34,974 and m is even also because you say, 793 00:43:34,974 --> 00:43:36,890 oh, I'd like a table the size of power of two. 794 00:43:36,890 --> 00:43:37,969 That seems natural. 795 00:43:37,969 --> 00:43:39,760 Then that will be really bad because you'll 796 00:43:39,760 --> 00:43:41,650 use only half the table. 797 00:43:41,650 --> 00:43:44,060 There are lots of situations where this is bad. 798 00:43:44,060 --> 00:43:46,640 In practice, it's pretty good. 799 00:43:46,640 --> 00:43:49,436 If m is prime, you always choose a prime table size, 800 00:43:49,436 --> 00:43:51,060 so you don't have those common factors. 801 00:43:51,060 --> 00:43:54,610 And it's not very close to a power of 2 or power of 10 802 00:43:54,610 --> 00:43:57,920 because real world powers of 2's and 10's are common. 803 00:43:57,920 --> 00:43:59,740 But it's very hackish, OK? 804 00:43:59,740 --> 00:44:02,990 It works a lot of the time but not always. 805 00:44:02,990 --> 00:44:07,570 A cooler method-- I think it's cooler-- still, 806 00:44:07,570 --> 00:44:14,290 you can't prove much about it-- Division didn't 807 00:44:14,290 --> 00:44:17,290 seem to work so great, so how about multiplication? 808 00:44:17,290 --> 00:44:18,140 What does that mean? 809 00:44:18,140 --> 00:44:20,420 Multiply by m, that wouldn't be very good. 810 00:44:20,420 --> 00:44:24,790 Now, it's a bit different. 811 00:44:24,790 --> 00:44:30,780 We're going to take the key, multiply it by an integer, a, 812 00:44:30,780 --> 00:44:35,000 and then we're going to do this crazy, crazy stuff. 813 00:44:35,000 --> 00:44:41,920 Take it mod 2 to the w and then shift it right, w minus r. 814 00:44:41,920 --> 00:44:42,450 OK. 815 00:44:42,450 --> 00:44:43,890 What is w? 816 00:44:43,890 --> 00:44:48,380 We're assuming that we're in a w-bit machine. 817 00:44:48,380 --> 00:44:51,780 Remember way back in models of computation? 818 00:44:51,780 --> 00:44:54,720 Your machine has a word size, it's w bits. 819 00:44:54,720 --> 00:44:56,450 So let's suppose it's w bits. 820 00:44:56,450 --> 00:44:59,530 So we have our key, k. 821 00:44:59,530 --> 00:45:00,050 Here it is. 822 00:45:00,050 --> 00:45:01,160 It's w bits long. 823 00:45:03,930 --> 00:45:07,340 We take some number, a-- think of a as being 824 00:45:07,340 --> 00:45:12,070 a random integer among all possible w bit integers. 825 00:45:12,070 --> 00:45:17,140 So it's got some zeros, it's got some ones. 826 00:45:17,140 --> 00:45:18,950 And I multiply these. 827 00:45:18,950 --> 00:45:20,630 What does multiplication mean in binary? 828 00:45:20,630 --> 00:45:25,560 Well, I take one of these copies of k for each one that's here. 829 00:45:25,560 --> 00:45:27,560 So I'm going to take one copy here 830 00:45:27,560 --> 00:45:29,320 because there's a one there. 831 00:45:29,320 --> 00:45:32,560 I'm going to take one copy here because there's a one there. 832 00:45:32,560 --> 00:45:35,510 And I'm going to take one copy here 833 00:45:35,510 --> 00:45:37,860 because there's a one there. 834 00:45:37,860 --> 00:45:40,990 And on average, half of them will be ones. 835 00:45:40,990 --> 00:45:46,150 So I have various copies of k, and then I just add them up. 836 00:45:46,150 --> 00:45:47,420 And you know, stuff happens. 837 00:45:47,420 --> 00:45:50,080 I get some gobbledygook here. 838 00:45:50,080 --> 00:45:50,580 OK. 839 00:45:50,580 --> 00:45:51,270 How big is it? 840 00:45:51,270 --> 00:45:53,710 In general, it's two words long. 841 00:45:53,710 --> 00:45:57,090 When I multiply two words I get two words. 842 00:45:57,090 --> 00:45:59,190 It could be twice as long, in general. 843 00:45:59,190 --> 00:46:03,480 And what this business is doing is saying take the right word, 844 00:46:03,480 --> 00:46:08,590 this right half here-- let the right word in, I guess, 845 00:46:08,590 --> 00:46:12,520 if you see vampire movies-- and then shift 846 00:46:12,520 --> 00:46:16,704 right-- this is a shift right operation-- by w minus r. 847 00:46:16,704 --> 00:46:17,870 I didn't even say what r is. 848 00:46:17,870 --> 00:46:21,130 But basically, what I want is these bits. 849 00:46:21,130 --> 00:46:24,780 I want r bits here-- this is w bits. 850 00:46:24,780 --> 00:46:29,258 I want the leftmost r bits of the rightmost w bits 851 00:46:29,258 --> 00:46:32,510 because I shift right here and get rid of all these guys. 852 00:46:32,510 --> 00:46:36,644 r-- I should say, m, is two to the r. 853 00:46:36,644 --> 00:46:38,060 So I'm going to assume here I have 854 00:46:38,060 --> 00:46:42,370 a table of size a power of 2, and then this number will 855 00:46:42,370 --> 00:46:47,440 be a number between 0 and m minus 1. 856 00:46:47,440 --> 00:46:47,940 OK. 857 00:46:47,940 --> 00:46:50,260 Why does this work? 858 00:46:50,260 --> 00:46:52,265 It's intuitive. 859 00:46:52,265 --> 00:46:54,390 In practice it works quite well because what you're 860 00:46:54,390 --> 00:46:57,090 doing is taking a whole bunch of sort of randomly 861 00:46:57,090 --> 00:47:00,200 shifted copies of k, adding them up-- you get carries, 862 00:47:00,200 --> 00:47:02,690 things get mixed up-- This is hashing. 863 00:47:02,690 --> 00:47:04,830 This is-- you're taking k, sort of cutting it up 864 00:47:04,830 --> 00:47:08,040 while you're shifting it around, adding things and they collide, 865 00:47:08,040 --> 00:47:09,660 and weird stuff happens. 866 00:47:09,660 --> 00:47:11,670 You sort of randomize stuff. 867 00:47:11,670 --> 00:47:13,440 Out here, you don't get much randomization 868 00:47:13,440 --> 00:47:15,420 because most-- like the last bit could just 869 00:47:15,420 --> 00:47:16,920 be this one bit of k. 870 00:47:16,920 --> 00:47:19,730 But in the middle, everybody's kind of colliding together. 871 00:47:19,730 --> 00:47:21,190 And so intuitively, you're mixing 872 00:47:21,190 --> 00:47:22,650 lots of things in the center. 873 00:47:22,650 --> 00:47:25,310 You take those r bits, roughly, in the center. 874 00:47:25,310 --> 00:47:27,550 That will be nicely mixed up. 875 00:47:27,550 --> 00:47:29,280 And most of the time this works well. 876 00:47:29,280 --> 00:47:33,950 In practice it works well-- I have some things written here. 877 00:47:33,950 --> 00:47:37,380 a better be odd, otherwise you're throwing away stuff. 878 00:47:37,380 --> 00:47:39,980 And it should not be very close to a power of 2. 879 00:47:39,980 --> 00:47:44,840 But it should be in between 2 to the r minus 1 and 2 to the r. 880 00:47:44,840 --> 00:47:47,080 Cool. 881 00:47:47,080 --> 00:47:47,580 One more. 882 00:47:52,750 --> 00:47:55,230 Again, theoretically, this can be bad. 883 00:47:55,230 --> 00:47:57,930 And I leave it as an exercise to find situations, find 884 00:47:57,930 --> 00:48:00,440 key values where this does not do a good job. 885 00:48:03,790 --> 00:48:07,540 The cool method is called universal hashing. 886 00:48:11,120 --> 00:48:14,495 This is something that's a bit beyond the scope of 006. 887 00:48:14,495 --> 00:48:17,440 If you want to understand it better you should take 046. 888 00:48:17,440 --> 00:48:21,437 But I'll give you the flavor and the method, one of the methods. 889 00:48:21,437 --> 00:48:23,020 There's actually many ways to do this. 890 00:48:33,690 --> 00:48:34,999 We see a mod m on the outside. 891 00:48:34,999 --> 00:48:37,540 That's just division method just to make the number between 0 892 00:48:37,540 --> 00:48:39,760 and a minus 1. 893 00:48:39,760 --> 00:48:41,245 Here's our key. 894 00:48:41,245 --> 00:48:42,870 And then there's these numbers a and b. 895 00:48:42,870 --> 00:48:49,220 These are going to be random numbers between 0 896 00:48:49,220 --> 00:48:51,350 and p minus 1. 897 00:48:51,350 --> 00:48:52,490 What's p? 898 00:48:52,490 --> 00:48:58,660 Prime number bigger than the size of the universe. 899 00:48:58,660 --> 00:49:00,430 So it's a big prime number. 900 00:49:00,430 --> 00:49:03,870 I think we know how to find prime numbers. 901 00:49:03,870 --> 00:49:05,770 We don't know in this class, but people 902 00:49:05,770 --> 00:49:07,740 know how to find the prime numbers. 903 00:49:07,740 --> 00:49:09,977 So there's a subroutine here, find a big prime number 904 00:49:09,977 --> 00:49:11,060 bigger than your universe. 905 00:49:11,060 --> 00:49:12,268 It's not too hard to do that. 906 00:49:12,268 --> 00:49:15,369 We can do it in polynomial time. 907 00:49:15,369 --> 00:49:16,160 That's just set up. 908 00:49:16,160 --> 00:49:19,220 You do that once for a given size table. 909 00:49:19,220 --> 00:49:23,916 And then you choose two random numbers, a and b. 910 00:49:23,916 --> 00:49:25,790 And then this is the hash function, a times k 911 00:49:25,790 --> 00:49:28,980 plus b, mod p mod m. 912 00:49:28,980 --> 00:49:29,480 OK. 913 00:49:29,480 --> 00:49:32,590 What does this do? 914 00:49:32,590 --> 00:49:35,810 It turns out-- here's the interesting part. 915 00:49:35,810 --> 00:49:45,260 For worst case keys, k1 and k2, that are distinct, 916 00:49:45,260 --> 00:49:56,650 the probability of h of k1 equaling h of k2 is 1 over n. 917 00:49:56,650 --> 00:49:59,820 So probability of two keys that are different colliding 918 00:49:59,820 --> 00:50:03,072 is 1 over m, for the worst case keys. 919 00:50:03,072 --> 00:50:04,280 What the heck does that mean? 920 00:50:04,280 --> 00:50:05,770 What's the probability over? 921 00:50:05,770 --> 00:50:08,390 Any suggestions? 922 00:50:08,390 --> 00:50:11,450 What's random here? 923 00:50:11,450 --> 00:50:12,200 AUDIENCE: a and b. 924 00:50:12,200 --> 00:50:13,090 PROFESSOR: a and b. 925 00:50:13,090 --> 00:50:15,250 This is the probability over a and b. 926 00:50:15,250 --> 00:50:18,350 This is the probability over the choice of your hash function. 927 00:50:18,350 --> 00:50:22,030 So it's the worst case inputs, worst case insertions, 928 00:50:22,030 --> 00:50:24,730 but random hash function. 929 00:50:24,730 --> 00:50:26,730 As long as you choose your random hash function, 930 00:50:26,730 --> 00:50:28,550 the probability of collision is 1 over m. 931 00:50:28,550 --> 00:50:31,130 This is the ideal situation 932 00:50:31,130 --> 00:50:34,140 And so you can prove, just like we analyzed here-- 933 00:50:34,140 --> 00:50:35,140 It's a little more work. 934 00:50:35,140 --> 00:50:35,910 It's in the notes. 935 00:50:35,910 --> 00:50:37,560 You use linearity of expectation. 936 00:50:37,560 --> 00:50:39,700 And you can prove, still, that the expected length 937 00:50:39,700 --> 00:50:42,620 of a chain-- the expected number of collisions that a key has 938 00:50:42,620 --> 00:50:48,720 with another key is the load factor, in the worst case, 939 00:50:48,720 --> 00:50:51,502 but in expectation for a given hash function. 940 00:50:51,502 --> 00:50:53,210 So still, the expected length of a chain, 941 00:50:53,210 --> 00:50:55,400 and therefore, the expected running time 942 00:50:55,400 --> 00:50:58,334 of hashing with chaining, using this hash function, 943 00:50:58,334 --> 00:51:00,750 or this collection of hash functions, or a randomly chosen 944 00:51:00,750 --> 00:51:03,450 one, is constant for constant load factor. 945 00:51:03,450 --> 00:51:05,689 And that's why hashing really works in theory. 946 00:51:05,689 --> 00:51:07,730 We're not going to go into details of this again. 947 00:51:07,730 --> 00:51:09,660 Take 6.046 if you want to know. 948 00:51:09,660 --> 00:51:12,470 But this should make you feel more comfortable. 949 00:51:12,470 --> 00:51:15,490 And we'll see other ways do hashing next class.