1
00:00:00,050 --> 00:00:01,770
The following
content is provided

2
00:00:01,770 --> 00:00:04,010
under a Creative
Commons license.

3
00:00:04,010 --> 00:00:06,860
Your support will help MIT
OpenCourseWare continue

4
00:00:06,860 --> 00:00:10,720
to offer high quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,330
To make a donation or
view additional materials

6
00:00:13,330 --> 00:00:17,207
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,207 --> 00:00:17,832
at ocw.mit.edu.

8
00:00:21,835 --> 00:00:22,710
PROFESSOR: All right.

9
00:00:22,710 --> 00:00:24,980
Let's get started.

10
00:00:24,980 --> 00:00:27,730
Today we start a brand
new section of 006,

11
00:00:27,730 --> 00:00:29,620
which is hashing.

12
00:00:29,620 --> 00:00:30,430
Hashing is cool.

13
00:00:30,430 --> 00:00:34,230
It is probably the most used
and common and important

14
00:00:34,230 --> 00:00:36,495
data structure and all
of computer science.

15
00:00:36,495 --> 00:00:41,860
It's in, basically, every system
you've ever used, I think.

16
00:00:41,860 --> 00:00:44,220
And in particular,
it's in Python

17
00:00:44,220 --> 00:00:46,326
as part of what makes
Python fun to program in.

18
00:00:46,326 --> 00:00:49,740
And basically, every modern
programming language has it.

19
00:00:49,740 --> 00:00:53,610
So today is about how to
make it actually happen.

20
00:00:53,610 --> 00:00:54,660
So what is it?

21
00:00:57,230 --> 00:00:58,690
It is usually
called a dictionary.

22
00:01:01,960 --> 00:01:04,280
So this is an
abstract data if you

23
00:01:04,280 --> 00:01:07,245
remember that term from
a couple lectures ago.

24
00:01:13,080 --> 00:01:15,450
It's kind of an old term,
not so common anymore,

25
00:01:15,450 --> 00:01:18,340
but it's useful to think about.

26
00:01:18,340 --> 00:01:22,900
So a dictionary is
a data structure,

27
00:01:22,900 --> 00:01:26,900
or it's a thing,
that can store items,

28
00:01:26,900 --> 00:01:29,395
and it can insert items, delete
items and search for items.

29
00:01:35,180 --> 00:01:40,230
So in general, it's going
to be a set of items,

30
00:01:40,230 --> 00:01:41,360
each item has a key.

31
00:01:47,720 --> 00:01:56,080
And you can insert an item,
you can delete an item

32
00:01:56,080 --> 00:02:06,720
from the set, and you can
search for a key, not an item.

33
00:02:06,720 --> 00:02:09,143
And the interesting
part is the search.

34
00:02:09,143 --> 00:02:10,934
I think you know what
insert and delete do.

35
00:02:20,360 --> 00:02:23,015
So there are two outcomes
to this kind of search.

36
00:02:23,015 --> 00:02:25,270
This is what I call
an exact search.

37
00:02:25,270 --> 00:02:29,440
Either you find an item with a
given key, or there isn't one,

38
00:02:29,440 --> 00:02:32,535
and then you just say
key error in Python.

39
00:02:41,950 --> 00:02:42,450
OK.

40
00:02:42,450 --> 00:02:44,149
This is a little
different from what

41
00:02:44,149 --> 00:02:45,690
we could do with
binary search trees.

42
00:02:45,690 --> 00:02:47,690
Binary search trees, if
we didn't find a key,

43
00:02:47,690 --> 00:02:50,900
we could find the next
larger and the next smaller

44
00:02:50,900 --> 00:02:52,661
successor and predecessor.

45
00:02:52,661 --> 00:02:54,660
With dictionaries you're
not allowed to do that,

46
00:02:54,660 --> 00:02:56,150
or you're not able to do that.

47
00:02:56,150 --> 00:02:57,900
And you're just
interested in the question

48
00:02:57,900 --> 00:02:58,930
does the key exist?

49
00:02:58,930 --> 00:03:02,090
And if so, give me the
item with that key.

50
00:03:02,090 --> 00:03:04,990
So we're assuming here that
the items have unique keys,

51
00:03:04,990 --> 00:03:07,362
no two items have the same key.

52
00:03:07,362 --> 00:03:08,820
And one way to
enforce that is when

53
00:03:08,820 --> 00:03:11,430
you insert an item
with an existing key,

54
00:03:11,430 --> 00:03:13,140
it overwrites whatever
key was there.

55
00:03:13,140 --> 00:03:15,070
That's the Python behavior.

56
00:03:15,070 --> 00:03:18,550
So we'll assume that.

57
00:03:18,550 --> 00:03:25,650
Overwrite any existing key.

58
00:03:31,730 --> 00:03:34,480
And so, it's well
defined what search does.

59
00:03:34,480 --> 00:03:36,116
Either there's one
item with that key,

60
00:03:36,116 --> 00:03:37,490
or there's no item
with that key,

61
00:03:37,490 --> 00:03:41,260
and it tells you what
the situation is.

62
00:03:41,260 --> 00:03:41,760
OK.

63
00:03:41,760 --> 00:03:47,710
So one way to solve
dictionaries is

64
00:03:47,710 --> 00:03:51,150
to use a balanced binary
search tree like AVL trees.

65
00:03:51,150 --> 00:03:54,710
And so you can do all of these
operations on log n time.

66
00:04:01,720 --> 00:04:04,220
I mean, you can ignore the
fact that AVL trees give you

67
00:04:04,220 --> 00:04:06,200
more information
when you do a search,

68
00:04:06,200 --> 00:04:08,540
and still does exact search.

69
00:04:08,540 --> 00:04:12,690
So that's one solution, but it
turns out you can do better.

70
00:04:12,690 --> 00:04:16,110
And while last class was about,
well, in the comparison model

71
00:04:16,110 --> 00:04:20,120
the best way to sort is n log
n and the best way to search

72
00:04:20,120 --> 00:04:21,570
is log n.

73
00:04:21,570 --> 00:04:23,870
Then we saw in the
RAM model, where

74
00:04:23,870 --> 00:04:27,600
if you assume your items are
integers we can sort faster,

75
00:04:27,600 --> 00:04:29,440
sometimes we can
sort in linear time.

76
00:04:29,440 --> 00:04:33,710
Today's lecture is about how to
search faster than log n time.

77
00:04:33,710 --> 00:04:37,680
And we're going to get
down to constant time.

78
00:04:37,680 --> 00:04:41,020
No-- basically, no
assumptions except, maybe,

79
00:04:41,020 --> 00:04:43,110
that your keys are integers.

80
00:04:43,110 --> 00:04:45,449
We'll be able to get
down to constant time

81
00:04:45,449 --> 00:04:46,365
with high probability.

82
00:04:48,890 --> 00:04:51,030
It's going to be a
randomized data structure.

83
00:04:51,030 --> 00:04:53,490
It's one of the few instances
of randomization in 006,

84
00:04:53,490 --> 00:04:56,260
but it'll be pretty simple
to analyze, so don't worry.

85
00:04:56,260 --> 00:04:59,332
But we're going to use
some probability today.

86
00:04:59,332 --> 00:05:00,415
Make it a little exciting.

87
00:05:03,290 --> 00:05:05,470
I think you know how
dictionaries work in Python.

88
00:05:05,470 --> 00:05:11,810
In Python it's the
dict data type.

89
00:05:11,810 --> 00:05:14,600
We've used it all
over the place.

90
00:05:14,600 --> 00:05:17,760
The key things you can
do are lookup a key

91
00:05:17,760 --> 00:05:24,100
and-- so this is the
analog of search--

92
00:05:24,100 --> 00:05:27,970
you can set a key to a value.

93
00:05:27,970 --> 00:05:30,960
This is the analog of an insert.

94
00:05:30,960 --> 00:05:33,170
It overwrites
whatever was there.

95
00:05:33,170 --> 00:05:33,810
And what else?

96
00:05:33,810 --> 00:05:34,610
Delete.

97
00:05:34,610 --> 00:05:38,130
So you can delete
a particular key.

98
00:05:42,114 --> 00:05:42,760
OK.

99
00:05:42,760 --> 00:05:44,593
We'll usually use this
notation because it's

100
00:05:44,593 --> 00:05:46,340
more familiar and intuitive.

101
00:05:46,340 --> 00:05:48,690
But the big topic today
is how do you actually

102
00:05:48,690 --> 00:05:53,070
implement these operations
for a dictionary, D?

103
00:05:53,070 --> 00:05:56,360
The one specific thing
about Python dictionaries

104
00:05:56,360 --> 00:06:01,465
is that an item is
basically a pair

105
00:06:01,465 --> 00:06:05,380
of two things, a
key and a value.

106
00:06:05,380 --> 00:06:07,410
And so, in particular,
when you call d.items

107
00:06:07,410 --> 00:06:11,280
you get a whole bunch of ordered
pairs, a key and a value.

108
00:06:11,280 --> 00:06:13,220
And so the key is always--
the key of an item

109
00:06:13,220 --> 00:06:15,152
is always this first part.

110
00:06:15,152 --> 00:06:16,135
So it's well defined.

111
00:06:20,035 --> 00:06:20,535
OK.

112
00:06:23,070 --> 00:06:28,120
So that's Python dictionaries.

113
00:06:28,120 --> 00:06:32,530
So one obvious motivation
for building dictionaries

114
00:06:32,530 --> 00:06:34,980
is you need them in Python.

115
00:06:34,980 --> 00:06:37,380
And in fact, people
use them all the time.

116
00:06:37,380 --> 00:06:39,830
We used them in docdist.

117
00:06:39,830 --> 00:06:43,890
All of the fastest versions of
the document distance problem

118
00:06:43,890 --> 00:06:48,080
used dictionaries for counting
words, how many times each word

119
00:06:48,080 --> 00:06:51,470
occurs in a document, and
for computing inner products,

120
00:06:51,470 --> 00:06:54,640
for finding common words
between two documents.

121
00:06:54,640 --> 00:06:57,035
And it's just it's the
best way to do things,

122
00:06:57,035 --> 00:07:00,467
it's the easiest way to do
things , and the fastest.

123
00:07:00,467 --> 00:07:02,800
As a result, dictionaries are
built into basically every

124
00:07:02,800 --> 00:07:06,980
modern programming language,
Python, Perl, Ruby, JavaScript,

125
00:07:06,980 --> 00:07:08,110
Java, C++, C#.

126
00:07:08,110 --> 00:07:10,970
In modern versions, all have
some version of dictionaries.

127
00:07:10,970 --> 00:07:13,790
And they all run in,
basically, constant time

128
00:07:13,790 --> 00:07:16,615
using the stuff that's in
this lecture and the next two

129
00:07:16,615 --> 00:07:17,115
lectures.

130
00:07:20,130 --> 00:07:21,300
Let's see.

131
00:07:21,300 --> 00:07:24,085
It's also, in, basically,
every database.

132
00:07:26,894 --> 00:07:29,310
There are essentially two kinds
of databases in the world,

133
00:07:29,310 --> 00:07:30,684
there are those
that use hashing,

134
00:07:30,684 --> 00:07:32,800
and there are those
that use search trees.

135
00:07:32,800 --> 00:07:33,760
Sometimes you need one.

136
00:07:33,760 --> 00:07:35,095
Sometimes you need the other.

137
00:07:35,095 --> 00:07:37,470
There are a lot of situations
in databases where you just

138
00:07:37,470 --> 00:07:39,082
need hashing.

139
00:07:39,082 --> 00:07:40,540
So if you've ever
used Berkeley DB,

140
00:07:40,540 --> 00:07:44,450
there's a hash
type of a database.

141
00:07:44,450 --> 00:07:48,460
So if things like, when
you go to Merriam-Webster,

142
00:07:48,460 --> 00:07:51,200
and you look up a
word, how do you

143
00:07:51,200 --> 00:07:53,860
find the definition
of that word?

144
00:07:53,860 --> 00:07:58,090
You use a hash table, you use
a dictionary, I should say.

145
00:07:58,090 --> 00:08:02,100
How do you-- when you
spell check your document,

146
00:08:02,100 --> 00:08:04,360
how do you tell whether a
word is correctly spelled?

147
00:08:04,360 --> 00:08:05,794
You look it up in a dictionary.

148
00:08:05,794 --> 00:08:07,210
If it's not correctly
spelled, how

149
00:08:07,210 --> 00:08:11,520
do you find the closest
related, correct spelling?

150
00:08:11,520 --> 00:08:12,895
You try tweaking
one the letters,

151
00:08:12,895 --> 00:08:15,103
and look it up in a dictionary
and see if it's there.

152
00:08:15,103 --> 00:08:17,600
You do that for all possible
letters, or maybe two letters.

153
00:08:17,600 --> 00:08:21,899
That is a state of the art
way to do spelling correction.

154
00:08:21,899 --> 00:08:23,440
Just keep looking
up in a dictionary.

155
00:08:23,440 --> 00:08:25,446
Because dictionaries
are so fast you

156
00:08:25,446 --> 00:08:27,945
can afford to do things like
trial perturbations of letters.

157
00:08:30,820 --> 00:08:32,039
What else.

158
00:08:32,039 --> 00:08:34,770
In the old days, which
means pre-Google,

159
00:08:34,770 --> 00:08:38,030
every search engine
on the web would

160
00:08:38,030 --> 00:08:41,260
have a dictionary that
says, for given word,

161
00:08:41,260 --> 00:08:44,120
give me all of the documents
containing that word.

162
00:08:44,120 --> 00:08:48,760
Google doesn't do it that
way, but that's another story.

163
00:08:48,760 --> 00:08:50,870
It's less fancy, actually.

164
00:08:50,870 --> 00:08:52,960
Or when you log
into a system, you

165
00:08:52,960 --> 00:08:54,940
type your username and password.

166
00:08:54,940 --> 00:08:57,762
You look in a dictionary
that stores a username

167
00:08:57,762 --> 00:08:59,220
and, associated
with that username,

168
00:08:59,220 --> 00:09:00,957
all the information
of that user.

169
00:09:00,957 --> 00:09:03,040
Every time you log into a
web system, or whatever,

170
00:09:03,040 --> 00:09:05,440
it is going through
a dictionary.

171
00:09:05,440 --> 00:09:07,520
So they're all over the place.

172
00:09:07,520 --> 00:09:09,330
One of the original
applications is

173
00:09:09,330 --> 00:09:11,212
in writing
programming languages.

174
00:09:11,212 --> 00:09:12,670
Some of the first
computer programs

175
00:09:12,670 --> 00:09:15,076
were programming languages,
so you could actually

176
00:09:15,076 --> 00:09:16,450
program them in
a reasonable way.

177
00:09:21,860 --> 00:09:25,497
Whenever you type a variable
name the computer doesn't

178
00:09:25,497 --> 00:09:27,080
really think about
that variable name,

179
00:09:27,080 --> 00:09:29,500
it wants to think about
an address in memory.

180
00:09:29,500 --> 00:09:31,820
And so you've got to
translate that variable name

181
00:09:31,820 --> 00:09:36,340
into a real, physical address
in the machine, or a position

182
00:09:36,340 --> 00:09:39,950
on the stack, or whatever
it is in real life.

183
00:09:39,950 --> 00:09:41,950
In the old days
of Python, I guess

184
00:09:41,950 --> 00:09:45,550
this is pre-Python
2 or so, 2.1, I

185
00:09:45,550 --> 00:09:48,390
don't remember the
exact transition it was.

186
00:09:48,390 --> 00:09:50,759
In the interpreter,
there was the dictionary

187
00:09:50,759 --> 00:09:52,300
of all your global
variables, there's

188
00:09:52,300 --> 00:09:54,420
a dictionary of all
your local variables.

189
00:09:54,420 --> 00:09:58,686
And that was-- it
was right there.

190
00:09:58,686 --> 00:10:00,310
I mean you could
modify the dictionary,

191
00:10:00,310 --> 00:10:01,393
you could do crazy things.

192
00:10:01,393 --> 00:10:03,520
And all the
variables were there.

193
00:10:03,520 --> 00:10:06,050
And so they'd match the
key to the actual value

194
00:10:06,050 --> 00:10:07,520
stored in the variable.

195
00:10:07,520 --> 00:10:09,920
They don't do that anymore
because it's a little slow,

196
00:10:09,920 --> 00:10:12,019
but-- and you could
do better in practice.

197
00:10:12,019 --> 00:10:14,310
But at the very least, when
you're compiling the thing,

198
00:10:14,310 --> 00:10:16,070
you need a dictionary.

199
00:10:16,070 --> 00:10:20,120
And then, later on, you can
do more efficient lookups.

200
00:10:20,120 --> 00:10:20,660
Let's see.

201
00:10:23,580 --> 00:10:26,490
On the internet there
are hash tables all over,

202
00:10:26,490 --> 00:10:28,730
like in your router.

203
00:10:28,730 --> 00:10:30,480
Router needs to know
all the machines that

204
00:10:30,480 --> 00:10:31,313
are connected to it.

205
00:10:31,313 --> 00:10:33,875
Each machine has an IP address,
so when you get a packet in,

206
00:10:33,875 --> 00:10:36,040
and it says, deliver to
this IP address, you see,

207
00:10:36,040 --> 00:10:38,060
oh, is it in my dictionary
of all the machines

208
00:10:38,060 --> 00:10:39,476
that are directly
connected to me?

209
00:10:39,476 --> 00:10:40,949
If so, send it there.

210
00:10:40,949 --> 00:10:42,990
If it's not then it has
to find the right subnet.

211
00:10:42,990 --> 00:10:44,573
That's not quite a
dictionary problem,

212
00:10:44,573 --> 00:10:45,820
a little more complicated.

213
00:10:45,820 --> 00:10:49,310
But for looking up local
machines, it's a dictionary.

214
00:10:49,310 --> 00:10:51,930
Routers use dictionaries because
they need to go really fast.

215
00:10:51,930 --> 00:10:55,450
They're getting a billion
packets every second.

216
00:10:55,450 --> 00:10:59,190
Also, in the network
stack of a machine,

217
00:10:59,190 --> 00:11:01,940
when you come in you
get it packet delivered

218
00:11:01,940 --> 00:11:04,980
to a particular port, you need
to say, oh, which application,

219
00:11:04,980 --> 00:11:06,880
or which socket is
connected to this port?

220
00:11:06,880 --> 00:11:08,630
All of these things
are dictionaries.

221
00:11:08,630 --> 00:11:10,088
The point is they're
in, basically,

222
00:11:10,088 --> 00:11:12,690
everything you've ever
used, virtual memory,

223
00:11:12,690 --> 00:11:15,060
I mean, they're
all over the place.

224
00:11:15,060 --> 00:11:16,935
There are also some more
subtle applications,

225
00:11:16,935 --> 00:11:18,893
where it is not obvious
that's it a dictionary,

226
00:11:18,893 --> 00:11:20,650
but still, we use
this idea of hashing

227
00:11:20,650 --> 00:11:22,230
we're going to talk about today.

228
00:11:22,230 --> 00:11:25,700
Like searching in a string.

229
00:11:30,350 --> 00:11:34,810
So when you hit-- I don't
know-- in your favorite editor,

230
00:11:34,810 --> 00:11:36,850
you do Control-F, or
Control-S, or slash,

231
00:11:36,850 --> 00:11:39,000
or whatever your way of
searching for something

232
00:11:39,000 --> 00:11:41,530
is, and you type start typing.

233
00:11:41,530 --> 00:11:43,930
If your editor is
clever, it will

234
00:11:43,930 --> 00:11:46,260
use hashing in order to
search for that string.

235
00:11:46,260 --> 00:11:49,770
It's a faster way to do it.

236
00:11:49,770 --> 00:11:54,410
If you use grep, for example, in
Unix it does it in a fancy way.

237
00:11:54,410 --> 00:11:56,240
Every time you do
a Google search

238
00:11:56,240 --> 00:11:58,150
it's essentially using this.

239
00:11:58,150 --> 00:11:59,310
It's solving this problem.

240
00:11:59,310 --> 00:12:01,330
I don't know what algorithm,
but we could guess.

241
00:12:01,330 --> 00:12:04,090
Using the algorithms we'll
cover in next lecture.

242
00:12:04,090 --> 00:12:06,650
It wouldn't surprise me.

243
00:12:06,650 --> 00:12:08,660
Also, if you have
a couple strings

244
00:12:08,660 --> 00:12:14,540
and you want to know what they
have in common, how similar

245
00:12:14,540 --> 00:12:15,384
they are?

246
00:12:15,384 --> 00:12:16,800
Example, you have
two DNA strings.

247
00:12:16,800 --> 00:12:20,480
You want to see how similar
they are, you use hashing.

248
00:12:20,480 --> 00:12:23,830
And you're going to do that
in the next problem set, PS4,

249
00:12:23,830 --> 00:12:27,000
which goes out on Thursday.

250
00:12:27,000 --> 00:12:31,990
Also, for things like file
and directory synchronization.

251
00:12:38,870 --> 00:12:42,770
So on Unix, if you rsync
or unison, or, I guess,

252
00:12:42,770 --> 00:12:46,740
modern day-- these
days, Dropbox, MIT

253
00:12:46,740 --> 00:12:49,580
startup-- Whenever you're
synchronizing files between two

254
00:12:49,580 --> 00:12:51,010
locations, you use
hashing to tell

255
00:12:51,010 --> 00:12:53,260
whether a file has changed,
or whether a directory has

256
00:12:53,260 --> 00:12:53,920
changed.

257
00:12:53,920 --> 00:12:56,940
That's a big idea.

258
00:12:56,940 --> 00:12:59,940
Fairly modern idea.

259
00:12:59,940 --> 00:13:02,210
And also in
cryptography-- this will

260
00:13:02,210 --> 00:13:07,480
be a topic of next
Tuesday's lecture.

261
00:13:07,480 --> 00:13:09,520
If you're transferring
a file and you

262
00:13:09,520 --> 00:13:12,070
want to check that you
actually transferred that file,

263
00:13:12,070 --> 00:13:15,672
and there wasn't some person in
the middle corrupting your file

264
00:13:15,672 --> 00:13:18,005
and making it look like it
was what you wanted it to be,

265
00:13:18,005 --> 00:13:21,420
you use something called
cryptographic hash functions,

266
00:13:21,420 --> 00:13:24,420
which [INAUDIBLE] will
talk about on Tuesday.

267
00:13:24,420 --> 00:13:27,230
So tons of motivation
for dictionaries.

268
00:13:27,230 --> 00:13:32,840
Let's actually do it,
see how they are done.

269
00:13:35,630 --> 00:13:40,990
We're going to start with sort
of a very simple straw man,

270
00:13:40,990 --> 00:13:44,089
and then we're going to improve
it until, by the end of today,

271
00:13:44,089 --> 00:13:46,130
we have a really good way
to solve the dictionary

272
00:13:46,130 --> 00:13:48,645
problem in constant
time for operation.

273
00:13:54,190 --> 00:13:56,980
So the really simple approach
is called a direct access table.

274
00:14:00,230 --> 00:14:05,450
So it's just a big
table, an array.

275
00:14:05,450 --> 00:14:14,340
You have-- the index into
the array is the key.

276
00:14:14,340 --> 00:14:27,530
So, store items in an
array, indexed by key.

277
00:14:31,990 --> 00:14:34,200
And in fact, Python kind
makes you think about this

278
00:14:34,200 --> 00:14:36,520
because the Python notation
for accessing dictionaries

279
00:14:36,520 --> 00:14:40,120
is identical to the notation
for accessing arrays.

280
00:14:40,120 --> 00:14:41,810
But with arrays, the
keys are restricted

281
00:14:41,810 --> 00:14:45,157
to be non-negative integers,
0 through n minus 1.

282
00:14:45,157 --> 00:14:46,740
So why not just
implement it that way?

283
00:14:46,740 --> 00:14:49,140
If your keys happen
to be integers

284
00:14:49,140 --> 00:14:52,410
I could just store all my
items in a giant array.

285
00:14:52,410 --> 00:14:56,385
So if I just want to store
an item here with key 2,

286
00:14:56,385 --> 00:15:00,130
call that, maybe, item
2, I just put that there.

287
00:15:00,130 --> 00:15:03,010
If I want to store
something with key 4

288
00:15:03,010 --> 00:15:04,560
I'll just put it there.

289
00:15:04,560 --> 00:15:07,950
Everything else is going to
be null, or none, or whatever.

290
00:15:07,950 --> 00:15:09,290
So lots of blank entries.

291
00:15:09,290 --> 00:15:13,380
Whatever keys I don't use I'll
just put a null value there.

292
00:15:13,380 --> 00:15:16,000
Every key that I want to
put into the dictionary

293
00:15:16,000 --> 00:15:19,620
I'll just store it at the
corresponding position.

294
00:15:19,620 --> 00:15:20,720
What's bad about this?

295
00:15:25,060 --> 00:15:25,560
Yeah.

296
00:15:25,560 --> 00:15:28,379
AUDIENCE: It's hard to associate
something with just an integer.

297
00:15:28,379 --> 00:15:30,670
PROFESSOR: Hard to associate
something with an integer.

298
00:15:30,670 --> 00:15:31,170
Good.

299
00:15:31,170 --> 00:15:33,360
That's one problem.

300
00:15:33,360 --> 00:15:36,100
There's actually two big
problems with this structure.

301
00:15:36,100 --> 00:15:37,580
I want both of them.

302
00:15:37,580 --> 00:15:48,040
So bad-- badness number one
is keys may not be integers.

303
00:16:00,021 --> 00:16:00,520
Good.

304
00:16:03,070 --> 00:16:04,754
Another problem.

305
00:16:04,754 --> 00:16:05,254
Yeah.

306
00:16:05,254 --> 00:16:06,750
AUDIENCE: Possibility
of collision.

307
00:16:06,750 --> 00:16:08,249
PROFESSOR: Possibility
of collision.

308
00:16:08,249 --> 00:16:09,540
So here there's no collisions.

309
00:16:09,540 --> 00:16:11,040
We'll get to
collisions in a moment,

310
00:16:11,040 --> 00:16:13,020
but a collision
is when two items

311
00:16:13,020 --> 00:16:16,330
go to the same
slot in this table.

312
00:16:16,330 --> 00:16:19,180
And we defined the problem
so there weren't collisions.

313
00:16:19,180 --> 00:16:21,390
We said whenever we insert
item with the same key you

314
00:16:21,390 --> 00:16:22,710
overwrite whatever is there.

315
00:16:22,710 --> 00:16:23,630
So collisions are OK.

316
00:16:23,630 --> 00:16:26,040
They will be a problem in a
moment, so save your answer.

317
00:16:26,040 --> 00:16:26,540
Yeah?

318
00:16:26,540 --> 00:16:27,415
AUDIENCE: [INAUDIBLE]

319
00:16:29,511 --> 00:16:30,510
PROFESSOR: Running time?

320
00:16:30,510 --> 00:16:32,070
AUDIENCE: [INAUDIBLE]

321
00:16:32,070 --> 00:16:33,200
PROFESSOR: For deletion?

322
00:16:33,200 --> 00:16:35,033
Actually, running time
is going to be great.

323
00:16:35,033 --> 00:16:37,860
If I want to insert-- I
mean, I do these operations

324
00:16:37,860 --> 00:16:39,900
but on array instead
of a dictionary.

325
00:16:39,900 --> 00:16:42,430
So if I want insert I
just put something there.

326
00:16:42,430 --> 00:16:44,480
If I want to delete I
just set it to null.

327
00:16:44,480 --> 00:16:46,950
If I want to search I just
go there and see is it null?

328
00:16:46,950 --> 00:16:47,601
Yeah?

329
00:16:47,601 --> 00:16:49,100
AUDIENCE: It's a
gigantic memory hog

330
00:16:49,100 --> 00:16:50,600
PROFESSOR: It's
gigantic memory hog.

331
00:16:50,600 --> 00:16:51,835
I like that phrasing.

332
00:16:57,750 --> 00:16:58,920
Not always of course.

333
00:16:58,920 --> 00:17:03,100
If it happens that your keys
are-- the set of possible keys

334
00:17:03,100 --> 00:17:06,470
is not too giant
then life is good.

335
00:17:06,470 --> 00:17:08,593
Let's see If I cannot
kill somebody today.

336
00:17:08,593 --> 00:17:09,859
Oh yes.

337
00:17:09,859 --> 00:17:11,650
Very good.

338
00:17:11,650 --> 00:17:13,490
But if you have a
lot of keys, you

339
00:17:13,490 --> 00:17:17,849
need one slot in
your array per key.

340
00:17:17,849 --> 00:17:19,290
That could be a lot.

341
00:17:19,290 --> 00:17:23,920
Maybe your keys are
64-bit integers.

342
00:17:23,920 --> 00:17:28,089
Then you need 264 slots just
to store one measly dictionary.

343
00:17:28,089 --> 00:17:30,210
That's huge.

344
00:17:30,210 --> 00:17:33,000
I guess there's also the
running time of initialize that.

345
00:17:33,000 --> 00:17:35,530
But at the very least,
you have huge space hog.

346
00:17:35,530 --> 00:17:37,410
This is bad.

347
00:17:37,410 --> 00:17:40,820
So we're going to fix both of
these problems one at a time.

348
00:17:40,820 --> 00:17:43,560
First problem we're
going to talk about

349
00:17:43,560 --> 00:17:45,991
is what if your keys
aren't integers?

350
00:17:45,991 --> 00:17:47,490
Because if your
keys aren't integers

351
00:17:47,490 --> 00:17:48,430
you can't use this at all.

352
00:17:48,430 --> 00:17:50,179
So lets at least get
something that works.

353
00:17:58,620 --> 00:18:00,410
And this is a notion
called prehashing.

354
00:18:03,157 --> 00:18:05,240
I guess different people
call it different things.

355
00:18:05,240 --> 00:18:07,800
Unfortunately Python
calls it hash.

356
00:18:07,800 --> 00:18:11,710
It's not hashing,
it's prehashing.

357
00:18:11,710 --> 00:18:13,960
Emphasized the "pre" here.

358
00:18:13,960 --> 00:18:19,250
So prehash function
maps whatever keys

359
00:18:19,250 --> 00:18:23,106
you have to
non-negative integers.

360
00:18:28,314 --> 00:18:30,230
At this point we're not
worrying about how big

361
00:18:30,230 --> 00:18:31,021
those integers are.

362
00:18:31,021 --> 00:18:32,270
They could be giant.

363
00:18:32,270 --> 00:18:34,920
We're not going to fix the
second problem til later.

364
00:18:34,920 --> 00:18:37,810
First problem is if I have
some key, maybe it's a string,

365
00:18:37,810 --> 00:18:42,682
it's whatever, it's an object,
how do I map it to some integer

366
00:18:42,682 --> 00:18:44,390
so I could, at least
in principle, put it

367
00:18:44,390 --> 00:18:48,052
in a direct access table.

368
00:18:48,052 --> 00:18:50,010
There's a theoretical
answer to how to do this,

369
00:18:50,010 --> 00:18:52,560
and then there's the practical
answer. how to do this.

370
00:18:52,560 --> 00:18:55,710
I'll start with
the mathematical.

371
00:18:55,710 --> 00:19:04,725
In theory, I like this, keys
are finite and discrete.

372
00:19:08,011 --> 00:19:08,510
OK.

373
00:19:08,510 --> 00:19:10,580
We know that anything
on the computer

374
00:19:10,580 --> 00:19:13,590
could, ultimately, be written
down as a string of bits.

375
00:19:13,590 --> 00:19:16,405
So a string of bits
represents an integer.

376
00:19:16,405 --> 00:19:17,540
So we're done.

377
00:19:24,160 --> 00:19:27,840
So in theory, this is easy.

378
00:19:27,840 --> 00:19:30,211
And we're going to
assume in this class,

379
00:19:30,211 --> 00:19:31,710
because it's sort
of a theory class,

380
00:19:31,710 --> 00:19:33,202
that this is what's happening.

381
00:19:33,202 --> 00:19:34,660
At least for
analysis, we're always

382
00:19:34,660 --> 00:19:37,040
going to analyze things as
if this is what's happening.

383
00:19:37,040 --> 00:19:39,070
Now in reality, people
don't always do this.

384
00:19:39,070 --> 00:19:44,060
In particular-- I'll
go somewhere else.

385
00:19:44,060 --> 00:20:05,817
In Python it's not
quite so simple,

386
00:20:05,817 --> 00:20:07,650
but at least you get
to see what's going on.

387
00:20:07,650 --> 00:20:10,940
There's a function called hash,
which should be called prehash,

388
00:20:10,940 --> 00:20:13,990
and it, given an
object, it produces

389
00:20:13,990 --> 00:20:16,580
a non-- I'm not sure,
actually, if it's non-negative.

390
00:20:16,580 --> 00:20:19,720
It's not a big deal if it has
a minus sign because then you

391
00:20:19,720 --> 00:20:21,770
could just use this and
get rid of the sign.

392
00:20:21,770 --> 00:20:24,590
But it maps every
object to an integer,

393
00:20:24,590 --> 00:20:27,217
or every hashable
object, technically.

394
00:20:27,217 --> 00:20:28,800
But pretty much
anything can be mapped

395
00:20:28,800 --> 00:20:31,350
to an integer, one
way or another.

396
00:20:31,350 --> 00:20:33,350
And so for example, if
you given it an integer

397
00:20:33,350 --> 00:20:35,040
it just returns the integer.

398
00:20:35,040 --> 00:20:36,220
So that's pretty easy.

399
00:20:36,220 --> 00:20:39,300
If you give it a string
it does something.

400
00:20:39,300 --> 00:20:40,730
I don't know exactly
what it does,

401
00:20:40,730 --> 00:20:41,813
but there are some issues.

402
00:20:41,813 --> 00:20:51,668
For example, hash of
this string, backslash 0B

403
00:20:51,668 --> 00:21:02,617
is equal to the hash of
backslash 0 backslash 0C 64.

404
00:21:02,617 --> 00:21:04,450
It's a little tricky
to find these examples,

405
00:21:04,450 --> 00:21:06,140
but they're out there.

406
00:21:06,140 --> 00:21:08,390
And I guess, this is
probably the lowest one

407
00:21:08,390 --> 00:21:10,640
in a certain measure.

408
00:21:10,640 --> 00:21:12,462
So it's a concern.

409
00:21:12,462 --> 00:21:14,670
In practice you have to be
careful about these things

410
00:21:14,670 --> 00:21:17,540
because what you'd
like-- in an ideal world,

411
00:21:17,540 --> 00:21:25,980
and in the theoretical world--
this prehash function of x,

412
00:21:25,980 --> 00:21:27,820
if it equals the
prehash function of y,

413
00:21:27,820 --> 00:21:31,380
this should only
happen when x=y,

414
00:21:31,380 --> 00:21:32,630
when they're the same thing.

415
00:21:35,450 --> 00:21:40,100
And equals equal sense, I guess,
would be the technical version.

416
00:21:40,100 --> 00:21:42,830
Sadly, in Python this
is not quite true.

417
00:21:42,830 --> 00:21:43,960
But mostly true.

418
00:21:48,030 --> 00:21:50,420
Let's see.

419
00:21:50,420 --> 00:21:53,460
If you define a custom
object, you may know this,

420
00:21:53,460 --> 00:21:58,020
there is an __hash__
method you can implement,

421
00:21:58,020 --> 00:22:01,740
which tells Python what
to do when you call hash

422
00:22:01,740 --> 00:22:02,480
of your object.

423
00:22:02,480 --> 00:22:05,380
If you don't, it
uses the default

424
00:22:05,380 --> 00:22:08,060
of id, which is the
physical location

425
00:22:08,060 --> 00:22:09,159
of your object in memory.

426
00:22:09,159 --> 00:22:11,450
So as long as your object
isn't moving around in memory

427
00:22:11,450 --> 00:22:13,130
this is a pretty
good hash function

428
00:22:13,130 --> 00:22:17,850
because no two items occupy
the same space in memory.

429
00:22:17,850 --> 00:22:21,430
So that's just implementation
side of things.

430
00:22:21,430 --> 00:22:28,010
Other implementation side
of things is in Python,

431
00:22:28,010 --> 00:22:31,070
well, there's this distinction
between objects and keys,

432
00:22:31,070 --> 00:22:32,020
I guess you would say.

433
00:22:32,020 --> 00:22:33,980
You really don't want
this prehash function

434
00:22:33,980 --> 00:22:36,370
to change value.

435
00:22:36,370 --> 00:22:38,710
In, say, a direct access
table, if you store--

436
00:22:38,710 --> 00:22:41,260
you take an item, you
compute the prehash function

437
00:22:41,260 --> 00:22:45,390
of the key in there, and you
throw it in, and it says,

438
00:22:45,390 --> 00:22:47,615
oh, prehash value is four.

439
00:22:47,615 --> 00:22:48,990
Then you put it
in position four.

440
00:22:48,990 --> 00:22:52,280
If that value change, then when
you go to search for that key,

441
00:22:52,280 --> 00:22:54,780
and you call prehash of that
thing, and if it give you five,

442
00:22:54,780 --> 00:22:57,570
you look in position five, and
you say, oh, it's not there.

443
00:22:57,570 --> 00:23:00,070
So prehash really
should not change.

444
00:23:00,070 --> 00:23:03,140
If you ever implement this
function don't mess with it.

445
00:23:03,140 --> 00:23:05,260
I mean, make sure it's
defined in such a way

446
00:23:05,260 --> 00:23:06,970
that it doesn't
change over time.

447
00:23:06,970 --> 00:23:10,622
Otherwise, you won't be able to
find your items in the table.

448
00:23:10,622 --> 00:23:12,080
Python can't protect
you from that.

449
00:23:15,320 --> 00:23:17,800
This is why, for example,
if you have a list,

450
00:23:17,800 --> 00:23:20,530
which is a mutable object, you
cannot put it into a hash table

451
00:23:20,530 --> 00:23:25,370
as a key value because it
would change over time.

452
00:23:25,370 --> 00:23:29,740
Potentially, you'd append
to the list, or whatever.

453
00:23:29,740 --> 00:23:31,680
All right.

454
00:23:31,680 --> 00:23:33,920
So hopefully you're
reasonably happy with this.

455
00:23:33,920 --> 00:23:34,990
You could also
think of it is we're

456
00:23:34,990 --> 00:23:36,948
going to assume keys are
non-negative integers.

457
00:23:36,948 --> 00:23:38,755
But in practice,
anything you have you

458
00:23:38,755 --> 00:23:42,770
can map to an integer,
one way or another.

459
00:23:42,770 --> 00:23:44,380
The bigger problem
in a certain sense,

460
00:23:44,380 --> 00:23:48,780
or the more interesting
problem is reducing space.

461
00:23:48,780 --> 00:23:49,860
So how do we do that?

462
00:23:58,420 --> 00:23:59,740
This would be hashing.

463
00:24:03,880 --> 00:24:06,840
This is sort of the magic
part of today's lecture.

464
00:24:06,840 --> 00:24:09,200
In case you're
wondering, hashing

465
00:24:09,200 --> 00:24:12,010
has nothing to do with hashish.

466
00:24:12,010 --> 00:24:17,610
Hashish is a Arabic root word
unrelated to the Germanic,

467
00:24:17,610 --> 00:24:20,220
which is hachet, I believe.

468
00:24:20,220 --> 00:24:20,900
Yeah.

469
00:24:20,900 --> 00:24:23,340
Or hacheh-- I guess,
something like that.

470
00:24:23,340 --> 00:24:24,530
I'm not very good at German.

471
00:24:24,530 --> 00:24:25,910
Which means hatchet.

472
00:24:25,910 --> 00:24:26,410
OK

473
00:24:26,410 --> 00:24:28,400
It's like you take your
key, and you cut it up

474
00:24:28,400 --> 00:24:31,060
into little pieces, and you mix
them around and cut and dice,

475
00:24:31,060 --> 00:24:32,570
and it's like cooking.

476
00:24:32,570 --> 00:24:33,511
OK.

477
00:24:33,511 --> 00:24:34,010
What?

478
00:24:34,010 --> 00:24:34,900
AUDIENCE: Hash browns.

479
00:24:34,900 --> 00:24:36,400
PROFESSOR: Hash
browns, for example.

480
00:24:36,400 --> 00:24:38,281
Yeah, same root.

481
00:24:38,281 --> 00:24:38,780
OK.

482
00:24:38,780 --> 00:24:41,611
It's like the only two English
words with that kind of hash.

483
00:24:41,611 --> 00:24:42,110
OK.

484
00:24:42,110 --> 00:24:45,130
In our case, it's
a verb, to hash.

485
00:24:45,130 --> 00:24:47,960
It means to cut into
pieces and mix around.

486
00:24:47,960 --> 00:24:48,460
OK.

487
00:24:48,460 --> 00:24:51,130
That won't really be clear
until towards the end of today's

488
00:24:51,130 --> 00:24:52,600
lecture, but we
will eventually get

489
00:24:52,600 --> 00:24:55,140
to the etymology of hashing.

490
00:24:55,140 --> 00:24:58,060
Or, we've got the etymology,
but why it's, actually,

491
00:24:58,060 --> 00:24:59,820
why we use that term.

492
00:24:59,820 --> 00:25:00,370
All right.

493
00:25:00,370 --> 00:25:10,860
So the big idea is we
take all possible keys

494
00:25:10,860 --> 00:25:13,975
and we want to reduce them
down to some small, small set

495
00:25:13,975 --> 00:25:14,475
of integers.

496
00:25:43,700 --> 00:25:45,810
Let me draw a picture of that.

497
00:25:55,640 --> 00:26:01,765
So we have this giant
space of all possible keys.

498
00:26:01,765 --> 00:26:03,060
We'll call this key space.

499
00:26:06,080 --> 00:26:08,230
It's like outer
space, basically.

500
00:26:08,230 --> 00:26:10,790
It's giant.

501
00:26:10,790 --> 00:26:12,670
And if we stored a
direct access table,

502
00:26:12,670 --> 00:26:13,730
this would also be giant.

503
00:26:13,730 --> 00:26:16,530
And we don't want to do that.

504
00:26:16,530 --> 00:26:21,500
We'd like to somehow map
using a hash function h down

505
00:26:21,500 --> 00:26:22,800
to some smaller set.

506
00:26:22,800 --> 00:26:25,310
How do I want to draw this?

507
00:26:25,310 --> 00:26:25,980
Like an array.

508
00:26:30,820 --> 00:26:36,960
So we're going to have possible
values 0 up to m minus 1.

509
00:26:36,960 --> 00:26:38,340
m is a new thing.

510
00:26:38,340 --> 00:26:40,400
It's going to be the
size of our hash table.

511
00:26:40,400 --> 00:26:41,630
Let's call the hash table.

512
00:26:45,107 --> 00:26:48,200
I think we'll call it t also.

513
00:26:48,200 --> 00:26:51,230
And we'd somehow like to map--

514
00:26:51,230 --> 00:26:51,730
All right.

515
00:26:51,730 --> 00:26:54,800
So there's a giant space
of all possible keys,

516
00:26:54,800 --> 00:26:57,900
but then there's a subset
of keys that are actually

517
00:26:57,900 --> 00:27:03,310
stored in this set,
in this dictionary.

518
00:27:03,310 --> 00:27:05,160
At any moment in
time there's some set

519
00:27:05,160 --> 00:27:07,730
of keys that are present.

520
00:27:07,730 --> 00:27:10,290
That set changes,
but at any moment

521
00:27:10,290 --> 00:27:12,780
there's some keys that
are actually there.

522
00:27:12,780 --> 00:27:17,180
k1, k2, k3, k4.

523
00:27:17,180 --> 00:27:20,890
I'd like to map them to
positions in this table.

524
00:27:20,890 --> 00:27:26,590
So maybe I store k2-- or
actually, item 2 would go here.

525
00:27:26,590 --> 00:27:34,000
In particular, this is when
h of k2, if it equals zero,

526
00:27:34,000 --> 00:27:36,000
then you'd put item 2 there.

527
00:27:36,000 --> 00:27:39,780
Item 3, let's say,
it's at position-- wow,

528
00:27:39,780 --> 00:27:42,240
3 would be a bit of a
coincidence, but what the hell.

529
00:27:42,240 --> 00:27:46,630
Maybe h or k3 equals 3.

530
00:27:46,630 --> 00:27:48,030
Then you'd put item 3 here.

531
00:27:50,540 --> 00:27:51,040
OK.

532
00:27:51,040 --> 00:27:51,750
You get the idea.

533
00:27:51,750 --> 00:27:54,180
So these four items each
have a special position

534
00:27:54,180 --> 00:27:55,530
in their table.

535
00:27:55,530 --> 00:28:02,880
And the idea is we would
like to be, m to be around n.

536
00:28:07,550 --> 00:28:19,280
n is the number of keys In
the dictionary right now.

537
00:28:19,280 --> 00:28:21,635
So if we could achieve
that, the size of the table

538
00:28:21,635 --> 00:28:23,760
was proportional to the
number of keys being stored

539
00:28:23,760 --> 00:28:26,970
in the dictionary, that would
be good news because then

540
00:28:26,970 --> 00:28:30,110
the space is not
gigantic and hoggish.

541
00:28:30,110 --> 00:28:33,657
It would just be linear,
which is optimal.

542
00:28:33,657 --> 00:28:35,490
So if we want to store
m things, maybe we'll

543
00:28:35,490 --> 00:28:38,630
use 2m space, a 3m
space, but not much more.

544
00:28:41,740 --> 00:28:45,140
How the heck are we going
to define such a function h?

545
00:28:45,140 --> 00:28:47,560
Well, that's the
rest of the lecture.

546
00:28:47,560 --> 00:28:49,224
But even before we
define a function h,

547
00:28:49,224 --> 00:28:50,640
do you see any
problems with this?

548
00:28:55,580 --> 00:28:56,146
Yeah.

549
00:28:56,146 --> 00:28:57,062
AUDIENCE: [INAUDIBLE].

550
00:29:02,764 --> 00:29:03,430
PROFESSOR: Yeah.

551
00:29:03,430 --> 00:29:05,560
This space over here, this
is pigeonhole principle.

552
00:29:05,560 --> 00:29:07,570
The number of slots for
your pigeons over here

553
00:29:07,570 --> 00:29:10,240
is way smaller than the
number of possible pigeons.

554
00:29:10,240 --> 00:29:13,110
So there are going
to be two keys that

555
00:29:13,110 --> 00:29:16,990
map to the same slot
in the hash table.

556
00:29:16,990 --> 00:29:18,365
This is what we
call a collision.

557
00:29:21,190 --> 00:29:24,500
Let's call this, I
don't know, ki, kj.

558
00:29:28,047 --> 00:29:35,500
h of ki equals h of kj,
but the keys are different.

559
00:29:35,500 --> 00:29:40,920
So ki does not equal kj,
yet their hash functions

560
00:29:40,920 --> 00:29:42,840
are the same, hash
values are the same.

561
00:29:42,840 --> 00:29:44,660
We call that a collision.

562
00:29:44,660 --> 00:29:48,990
And that's guaranteed to
happen a lot, yet somehow,

563
00:29:48,990 --> 00:29:51,097
we can still make this work.

564
00:29:51,097 --> 00:29:51,805
That's the magic.

565
00:29:57,640 --> 00:29:59,350
And that is going
to be chaining.

566
00:29:59,350 --> 00:30:02,080
We've done these guys.

567
00:30:02,080 --> 00:30:05,264
Next up is a technique for
dealing with collisions.

568
00:30:05,264 --> 00:30:07,430
There are two techniques
for dealing with collisions

569
00:30:07,430 --> 00:30:09,762
we're going to
talk about in 006.

570
00:30:09,762 --> 00:30:11,720
One is called chaining,
and next Tuesday, we'll

571
00:30:11,720 --> 00:30:15,450
see another method
called open addressing.

572
00:30:15,450 --> 00:30:17,170
But let's start with chaining.

573
00:30:21,220 --> 00:30:24,730
The idea with chaining a simple.

574
00:30:24,730 --> 00:30:28,400
If you have multiple items
here all with the same-- that

575
00:30:28,400 --> 00:30:32,860
hash to the same position,
just store them as a list.

576
00:30:32,860 --> 00:30:35,050
I'm going to draw
it as a linked list.

577
00:31:02,850 --> 00:31:06,688
I think I need a
big picture here.

578
00:31:27,710 --> 00:31:35,270
So we have our nice universe,
various keys that we actually

579
00:31:35,270 --> 00:31:37,700
have present.

580
00:31:37,700 --> 00:31:42,740
So these are the keys
in the dictionary,

581
00:31:42,740 --> 00:31:44,280
and this is all of key space.

582
00:31:53,170 --> 00:31:56,170
These guys map to
slots in the table.

583
00:31:56,170 --> 00:31:58,490
Some of them might
map to the same value.

584
00:31:58,490 --> 00:32:04,975
So let's say k1 and k2,
suppose they collide.

585
00:32:04,975 --> 00:32:06,620
So they both go this slot.

586
00:32:06,620 --> 00:32:11,230
What we're going to store
here is a linked list

587
00:32:11,230 --> 00:32:16,750
that stores item 1,
and stores a pointer

588
00:32:16,750 --> 00:32:21,450
to the next item,
which is item 2.

589
00:32:21,450 --> 00:32:23,380
And that's the end of the list.

590
00:32:23,380 --> 00:32:27,160
Or you could-- however
you want to draw a null.

591
00:32:27,160 --> 00:32:30,320
So however many items
there are, we're

592
00:32:30,320 --> 00:32:33,700
going to have a linked list
of that length in that slot.

593
00:32:33,700 --> 00:32:37,440
So in particular, if there's
just one item, like say,

594
00:32:37,440 --> 00:32:42,430
this k3 here, maybe it
just maps to this slot.

595
00:32:42,430 --> 00:32:44,350
And maybe that's all
that maps to that slot.

596
00:32:44,350 --> 00:32:48,330
In that case, we just
say, follow this item 3,

597
00:32:48,330 --> 00:32:50,370
and there's no other items.

598
00:32:50,370 --> 00:32:52,680
Some slots are going
to be completely empty.

599
00:32:52,680 --> 00:32:56,440
There nothing there so you
just store a null pointer.

600
00:32:56,440 --> 00:32:58,150
That is hashing with chaining.

601
00:32:58,150 --> 00:33:02,350
It's pretty simple,
very simple really.

602
00:33:02,350 --> 00:33:05,550
The only question is why would
you expect it to be any good?

603
00:33:05,550 --> 00:33:08,960
Because, in the worst case,
if you fix your hash function

604
00:33:08,960 --> 00:33:11,920
here, h, there's going to
be a whole bunch of keys

605
00:33:11,920 --> 00:33:13,370
that all map to the same slot.

606
00:33:13,370 --> 00:33:16,330
And so in the worst case, those
are the keys that you insert,

607
00:33:16,330 --> 00:33:17,747
and they all go here.

608
00:33:17,747 --> 00:33:19,580
And then you have this
fancy data structure.

609
00:33:19,580 --> 00:33:23,100
And in the end, all you have is
a linked list of all n items.

610
00:33:23,100 --> 00:33:30,950
So the worst case is theta n.

611
00:33:30,950 --> 00:33:34,520
And this is going to be true for
any hashing scheme, actually.

612
00:33:34,520 --> 00:33:36,710
In the worst case,
hashing sucks.

613
00:33:36,710 --> 00:33:39,400
Yet in practice, it works
really, really well.

614
00:33:39,400 --> 00:33:41,960
And the reason is
randomization, essentially,

615
00:33:41,960 --> 00:33:45,620
that this hash function,
unless you're really unlucky,

616
00:33:45,620 --> 00:33:48,270
the hash function will
nicely distribute your items,

617
00:33:48,270 --> 00:33:52,700
and most of these lists
will have constant length.

618
00:33:52,700 --> 00:34:00,720
We're going to prove
that under an assumption.

619
00:34:00,720 --> 00:34:02,380
Well have to warm
up a little bit.

620
00:34:07,000 --> 00:34:09,739
But I'm also going to cop
out a little m as you'll see.

621
00:34:22,960 --> 00:34:27,250
So in 006 we're going to make
an assumption called Simple

622
00:34:27,250 --> 00:34:29,080
Uniform Hashing.

623
00:34:29,080 --> 00:34:31,255
OK.

624
00:34:31,255 --> 00:34:35,850
And this is an assumption,
it's an unrealistic assumption.

625
00:34:35,850 --> 00:34:40,330
I would go so far as to say
it's false, a false assumption.

626
00:34:40,330 --> 00:34:42,236
But it's really
convenient for analysis,

627
00:34:42,236 --> 00:34:43,610
and it's going to
make it obvious

628
00:34:43,610 --> 00:34:45,580
why chaining is a good idea.

629
00:34:45,580 --> 00:34:48,139
Sadly, the assumption
isn't quite true,

630
00:34:48,139 --> 00:34:49,969
but it gives you a flavor.

631
00:34:49,969 --> 00:34:52,080
If you want to see why
hashing is actually good,

632
00:34:52,080 --> 00:34:53,955
I'm going to hint at it
at the end of lecture

633
00:34:53,955 --> 00:34:55,611
but really should
take 6.046 Yeah.

634
00:34:55,611 --> 00:34:56,902
AUDIENCE: [INAUDIBLE] question.

635
00:34:56,902 --> 00:34:59,182
Is the hashing
function [INAUDIBLE]?

636
00:34:59,182 --> 00:35:01,348
Like, how do we know the
array is still [INAUDIBLE]?

637
00:35:01,348 --> 00:35:01,931
PROFESSOR: OK.

638
00:35:01,931 --> 00:35:07,620
The hashing function-- I guess
I didn't specify up here.

639
00:35:07,620 --> 00:35:14,160
The hashing function maps
your universe to 0, 1,

640
00:35:14,160 --> 00:35:17,520
up to m minus 1,
That's the definition.

641
00:35:17,520 --> 00:35:23,090
So it's guaranteed to reduce the
space of keys to just m slots.

642
00:35:23,090 --> 00:35:25,467
So your hashing function
needs to know what m is.

643
00:35:25,467 --> 00:35:27,800
In reality there's not going
to be one hashing function,

644
00:35:27,800 --> 00:35:30,669
there's going to be 1 for each
m, or at least one for each m.

645
00:35:30,669 --> 00:35:32,460
And so, depending on
how big your table is,

646
00:35:32,460 --> 00:35:34,337
you use the corresponding
hash function.

647
00:35:34,337 --> 00:35:35,170
Yeah, good question.

648
00:35:35,170 --> 00:35:36,545
So the hash function
is what does

649
00:35:36,545 --> 00:35:39,110
the work of reducing
your key space down

650
00:35:39,110 --> 00:35:40,730
to small set of slots.

651
00:35:40,730 --> 00:35:44,180
So that's what's going
to give us low space.

652
00:35:44,180 --> 00:35:44,850
OK.

653
00:35:44,850 --> 00:35:47,006
But now, how do we get low time?

654
00:35:47,006 --> 00:35:49,255
Let me just state this
assumption and get to business.

655
00:36:33,300 --> 00:36:35,510
Simply, uniform hashing
is, essentially,

656
00:36:35,510 --> 00:36:38,050
two probabilistic assumptions.

657
00:36:38,050 --> 00:36:41,360
The first one is uniformity.

658
00:36:41,360 --> 00:36:44,070
If you take some
key in your space

659
00:36:44,070 --> 00:36:46,170
that you want to store
the hash function

660
00:36:46,170 --> 00:36:49,230
maps it to a uniform
random choice.

661
00:36:49,230 --> 00:36:51,540
This is, of course, is
what you want to happen.

662
00:36:51,540 --> 00:36:58,271
Each of these slots here is
equally likely to be hashed to.

663
00:36:58,271 --> 00:36:58,770
OK.

664
00:36:58,770 --> 00:37:00,020
That's a good start.

665
00:37:00,020 --> 00:37:03,550
But to do proper analysis,
not only do we uniformity,

666
00:37:03,550 --> 00:37:05,745
we also need independence.

667
00:37:05,745 --> 00:37:07,870
So not only is this true
for each key individually,

668
00:37:07,870 --> 00:37:10,210
but it's true for all
the keys together.

669
00:37:10,210 --> 00:37:13,500
So if key one maps to
a uniform random place,

670
00:37:13,500 --> 00:37:16,840
no matter where it
goes, key two also

671
00:37:16,840 --> 00:37:18,270
matches to a uniform
random place.

672
00:37:18,270 --> 00:37:19,644
And no matter
where those two go,

673
00:37:19,644 --> 00:37:22,040
key three maps to a
uniform random place.

674
00:37:22,040 --> 00:37:23,640
This really can't be true.

675
00:37:23,640 --> 00:37:27,830
But if it's true, we can prove
that this takes constant time.

676
00:37:27,830 --> 00:37:29,500
So let me do that.

677
00:37:41,180 --> 00:37:45,660
So under this assumption,
we can analyze

678
00:37:45,660 --> 00:37:51,400
hashing-- hashing with chaining
is what this method is called.

679
00:37:51,400 --> 00:37:56,400
So let's do it

680
00:37:56,400 --> 00:37:59,319
I want to know-- I
got to cheat, sorry.

681
00:37:59,319 --> 00:38:00,610
I got to remember the notation.

682
00:38:03,690 --> 00:38:05,460
I don't have any
good notation here.

683
00:38:05,460 --> 00:38:08,100
All right.

684
00:38:08,100 --> 00:38:12,965
What I'd like to know is the
expected length of a chain.

685
00:38:18,460 --> 00:38:18,960
OK.

686
00:38:18,960 --> 00:38:25,160
Now this is if I have n keys
that are stored in the table,

687
00:38:25,160 --> 00:38:29,220
and m slots in the
table, then what

688
00:38:29,220 --> 00:38:32,030
is the expected
length of a chain?

689
00:38:32,030 --> 00:38:33,131
Any suggestions.

690
00:38:33,131 --> 00:38:33,630
Yeah.

691
00:38:33,630 --> 00:38:35,870
AUDIENCE: 1 over m to the n.

692
00:38:35,870 --> 00:38:37,480
PROFESSOR: 1 over m to the n?

693
00:38:37,480 --> 00:38:41,700
That's going to be a
probability of something.

694
00:38:41,700 --> 00:38:42,200
Not quite.

695
00:38:42,200 --> 00:38:43,370
AUDIENCE: [INAUDIBLE]

696
00:38:43,370 --> 00:38:44,786
PROFESSOR: That's
between 0 and 1.

697
00:38:44,786 --> 00:38:47,100
It's probably at least
one, or something.

698
00:38:47,100 --> 00:38:47,855
Yeah.

699
00:38:47,855 --> 00:38:49,190
AUDIENCE: m over n.

700
00:38:49,190 --> 00:38:51,136
PROFESSOR: n over m, yeah.

701
00:38:54,630 --> 00:38:56,070
It's really easy.

702
00:38:56,070 --> 00:39:00,010
The chance of a key going to
a particular slot is 1 over m.

703
00:39:00,010 --> 00:39:03,020
They're all independent, so
it's 1 over m, plus 1 over m,

704
00:39:03,020 --> 00:39:05,160
plus 1 over m, n times.

705
00:39:05,160 --> 00:39:07,100
So it's n over m.

706
00:39:07,100 --> 00:39:10,730
This is really easy when
you have independence.

707
00:39:10,730 --> 00:39:13,210
Sadly, in the real world,
you don't have independence.

708
00:39:13,210 --> 00:39:15,806
We're going to call
this thing alpha,

709
00:39:15,806 --> 00:39:21,560
and it's also known as the
load factor of the table.

710
00:39:21,560 --> 00:39:24,960
So if it's one, n equals m.

711
00:39:24,960 --> 00:39:27,650
And so the length
of a chain is one.

712
00:39:27,650 --> 00:39:31,350
If it's 10, then you have
10 times as many elements

713
00:39:31,350 --> 00:39:32,130
as you have slots.

714
00:39:32,130 --> 00:39:34,560
But still, the expected
length of a chain is 10.

715
00:39:34,560 --> 00:39:35,660
That's a constant.

716
00:39:35,660 --> 00:39:36,470
It's OK.

717
00:39:36,470 --> 00:39:38,664
If it's a 12, that's OK.

718
00:39:38,664 --> 00:39:41,080
It means that you have a bigger
table than you have items.

719
00:39:41,080 --> 00:39:45,210
As long as it's a constant,
as long as we have-- I

720
00:39:45,210 --> 00:39:49,817
erased it by now-- as
long as m is theta n,

721
00:39:49,817 --> 00:39:51,025
this is going to be constant.

722
00:39:55,730 --> 00:39:57,710
And so we need to
maintain this property.

723
00:39:57,710 --> 00:40:00,290
But as long as you set your
table size to the right value,

724
00:40:00,290 --> 00:40:04,730
to be roughly n, this
will be constant.

725
00:40:04,730 --> 00:40:12,900
And so the running time of
an operation, insert, delete,

726
00:40:12,900 --> 00:40:17,480
and search-- Well,
search is really

727
00:40:17,480 --> 00:40:20,430
the hardest because when you
want to search for a key,

728
00:40:20,430 --> 00:40:24,700
you map it into your table,
then you walk the linked list

729
00:40:24,700 --> 00:40:26,692
and look for the key that
you're searching for.

730
00:40:26,692 --> 00:40:28,400
Now is this the key
you're searching for?

731
00:40:28,400 --> 00:40:30,350
No, it's not the key
you're searching for.

732
00:40:30,350 --> 00:40:31,957
Is this the key
you're searching for?

733
00:40:31,957 --> 00:40:33,790
Those are not the keys
you're searching for.

734
00:40:33,790 --> 00:40:34,510
You keep going.

735
00:40:34,510 --> 00:40:36,650
Either you find your
key or you don't.

736
00:40:36,650 --> 00:40:40,267
But in the worst case, you
have to walk the entire list.

737
00:40:40,267 --> 00:40:42,350
Sorry for the bad Star
Trek reference-- Star Wars.

738
00:40:42,350 --> 00:40:45,110
God.

739
00:40:45,110 --> 00:40:45,930
I'm not awake.

740
00:40:45,930 --> 00:40:48,370
All right.

741
00:40:48,370 --> 00:40:50,820
In general, the running
time, in the worst case,

742
00:40:50,820 --> 00:40:53,785
is 1 plus the length
of your chain.

743
00:40:56,330 --> 00:40:56,830
OK.

744
00:40:56,830 --> 00:40:59,340
So it's going to
be 1 plus alpha.

745
00:40:59,340 --> 00:41:00,970
Why do I write one?

746
00:41:00,970 --> 00:41:04,930
Well, because alpha can be much
smaller than 1, in general.

747
00:41:04,930 --> 00:41:06,420
And you always have
to pay the cost

748
00:41:06,420 --> 00:41:07,810
of computing the hash function.

749
00:41:07,810 --> 00:41:10,770
We're going to assume
that takes constant time.

750
00:41:10,770 --> 00:41:13,200
And then you have to
follow the first pointer.

751
00:41:13,200 --> 00:41:17,590
So you always pay constant time,
but then you also pay alpha.

752
00:41:17,590 --> 00:41:20,470
That's your expected life.

753
00:41:20,470 --> 00:41:20,970
OK.

754
00:41:20,970 --> 00:41:21,930
That's the analysis.

755
00:41:21,930 --> 00:41:23,080
It's super simple.

756
00:41:23,080 --> 00:41:25,860
If you assume Simple
Uniform Hashing,

757
00:41:25,860 --> 00:41:30,550
it's clear, as long as your load
factor is constant, m theta n,

758
00:41:30,550 --> 00:41:33,490
you get constant running
time for all your operations.

759
00:41:33,490 --> 00:41:34,530
Life is good.

760
00:41:34,530 --> 00:41:37,010
This is the intuition
of why hashing works.

761
00:41:37,010 --> 00:41:39,140
It's not really
why hashing works.

762
00:41:39,140 --> 00:41:43,176
But it's about as far as
we're going to get in 006.

763
00:41:43,176 --> 00:41:44,800
I'm going to tell
you a little bit more

764
00:41:44,800 --> 00:41:49,380
about why hashing is actually
good to practice and in theory.

765
00:42:06,820 --> 00:42:10,020
What are we up to?

766
00:42:10,020 --> 00:42:12,740
Last topic is hash functions.

767
00:42:12,740 --> 00:42:16,380
The one remaining thing
is how do I construct h?

768
00:42:16,380 --> 00:42:19,800
How do I actually map from
this giant universe of keys

769
00:42:19,800 --> 00:42:24,961
to this small set of slots in
the table, there's m of them?

770
00:42:29,260 --> 00:42:34,140
I'm going to give you three hash
functions, two of which are,

771
00:42:34,140 --> 00:42:37,210
let's say, common practice, and
the third of which is actually

772
00:42:37,210 --> 00:42:38,710
theoretically good.

773
00:42:38,710 --> 00:42:40,930
So the first two are
not good theoretically.

774
00:42:40,930 --> 00:42:43,060
You can prove that they're
bad, but at least they

775
00:42:43,060 --> 00:42:45,190
give you some
flavor, and they're

776
00:42:45,190 --> 00:42:51,979
still common in practice because
a lot of the time they're OK,

777
00:42:51,979 --> 00:42:53,770
but you can't really
prove much about them.

778
00:42:56,490 --> 00:42:56,990
OK.

779
00:42:56,990 --> 00:43:03,000
So first method, sort
of the obvious one,

780
00:43:03,000 --> 00:43:04,940
called the division method.

781
00:43:04,940 --> 00:43:06,820
And if you have
a key, this could

782
00:43:06,820 --> 00:43:09,950
be a giant key, huge
universe of keys,

783
00:43:09,950 --> 00:43:14,065
you just take that
key, modulo m,

784
00:43:14,065 --> 00:43:16,190
that gives you a number
between zero and m minus 1.

785
00:43:16,190 --> 00:43:17,110
Done.

786
00:43:17,110 --> 00:43:19,542
It's so easy.

787
00:43:19,542 --> 00:43:21,000
I'm not going to
tell you in detail

788
00:43:21,000 --> 00:43:22,660
why this is a bad method.

789
00:43:22,660 --> 00:43:24,060
Maybe you can think about it.

790
00:43:24,060 --> 00:43:29,890
It's especially bad if m has
some common factors with k.

791
00:43:29,890 --> 00:43:32,980
Like, let's say
k is even always,

792
00:43:32,980 --> 00:43:34,974
and m is even also
because you say,

793
00:43:34,974 --> 00:43:36,890
oh, I'd like a table the
size of power of two.

794
00:43:36,890 --> 00:43:37,969
That seems natural.

795
00:43:37,969 --> 00:43:39,760
Then that will be really
bad because you'll

796
00:43:39,760 --> 00:43:41,650
use only half the table.

797
00:43:41,650 --> 00:43:44,060
There are lots of situations
where this is bad.

798
00:43:44,060 --> 00:43:46,640
In practice, it's pretty good.

799
00:43:46,640 --> 00:43:49,436
If m is prime, you always
choose a prime table size,

800
00:43:49,436 --> 00:43:51,060
so you don't have
those common factors.

801
00:43:51,060 --> 00:43:54,610
And it's not very close to
a power of 2 or power of 10

802
00:43:54,610 --> 00:43:57,920
because real world powers
of 2's and 10's are common.

803
00:43:57,920 --> 00:43:59,740
But it's very hackish, OK?

804
00:43:59,740 --> 00:44:02,990
It works a lot of the
time but not always.

805
00:44:02,990 --> 00:44:07,570
A cooler method-- I think
it's cooler-- still,

806
00:44:07,570 --> 00:44:14,290
you can't prove much
about it-- Division didn't

807
00:44:14,290 --> 00:44:17,290
seem to work so great, so
how about multiplication?

808
00:44:17,290 --> 00:44:18,140
What does that mean?

809
00:44:18,140 --> 00:44:20,420
Multiply by m, that
wouldn't be very good.

810
00:44:20,420 --> 00:44:24,790
Now, it's a bit different.

811
00:44:24,790 --> 00:44:30,780
We're going to take the key,
multiply it by an integer, a,

812
00:44:30,780 --> 00:44:35,000
and then we're going to do
this crazy, crazy stuff.

813
00:44:35,000 --> 00:44:41,920
Take it mod 2 to the w and
then shift it right, w minus r.

814
00:44:41,920 --> 00:44:42,450
OK.

815
00:44:42,450 --> 00:44:43,890
What is w?

816
00:44:43,890 --> 00:44:48,380
We're assuming that
we're in a w-bit machine.

817
00:44:48,380 --> 00:44:51,780
Remember way back in
models of computation?

818
00:44:51,780 --> 00:44:54,720
Your machine has a
word size, it's w bits.

819
00:44:54,720 --> 00:44:56,450
So let's suppose it's w bits.

820
00:44:56,450 --> 00:44:59,530
So we have our key, k.

821
00:44:59,530 --> 00:45:00,050
Here it is.

822
00:45:00,050 --> 00:45:01,160
It's w bits long.

823
00:45:03,930 --> 00:45:07,340
We take some number,
a-- think of a as being

824
00:45:07,340 --> 00:45:12,070
a random integer among all
possible w bit integers.

825
00:45:12,070 --> 00:45:17,140
So it's got some zeros,
it's got some ones.

826
00:45:17,140 --> 00:45:18,950
And I multiply these.

827
00:45:18,950 --> 00:45:20,630
What does multiplication
mean in binary?

828
00:45:20,630 --> 00:45:25,560
Well, I take one of these copies
of k for each one that's here.

829
00:45:25,560 --> 00:45:27,560
So I'm going to
take one copy here

830
00:45:27,560 --> 00:45:29,320
because there's a one there.

831
00:45:29,320 --> 00:45:32,560
I'm going to take one copy here
because there's a one there.

832
00:45:32,560 --> 00:45:35,510
And I'm going to
take one copy here

833
00:45:35,510 --> 00:45:37,860
because there's a one there.

834
00:45:37,860 --> 00:45:40,990
And on average, half
of them will be ones.

835
00:45:40,990 --> 00:45:46,150
So I have various copies of k,
and then I just add them up.

836
00:45:46,150 --> 00:45:47,420
And you know, stuff happens.

837
00:45:47,420 --> 00:45:50,080
I get some gobbledygook here.

838
00:45:50,080 --> 00:45:50,580
OK.

839
00:45:50,580 --> 00:45:51,270
How big is it?

840
00:45:51,270 --> 00:45:53,710
In general, it's two words long.

841
00:45:53,710 --> 00:45:57,090
When I multiply two
words I get two words.

842
00:45:57,090 --> 00:45:59,190
It could be twice
as long, in general.

843
00:45:59,190 --> 00:46:03,480
And what this business is doing
is saying take the right word,

844
00:46:03,480 --> 00:46:08,590
this right half here-- let
the right word in, I guess,

845
00:46:08,590 --> 00:46:12,520
if you see vampire
movies-- and then shift

846
00:46:12,520 --> 00:46:16,704
right-- this is a shift right
operation-- by w minus r.

847
00:46:16,704 --> 00:46:17,870
I didn't even say what r is.

848
00:46:17,870 --> 00:46:21,130
But basically, what
I want is these bits.

849
00:46:21,130 --> 00:46:24,780
I want r bits here--
this is w bits.

850
00:46:24,780 --> 00:46:29,258
I want the leftmost r bits
of the rightmost w bits

851
00:46:29,258 --> 00:46:32,510
because I shift right here
and get rid of all these guys.

852
00:46:32,510 --> 00:46:36,644
r-- I should say,
m, is two to the r.

853
00:46:36,644 --> 00:46:38,060
So I'm going to
assume here I have

854
00:46:38,060 --> 00:46:42,370
a table of size a power of
2, and then this number will

855
00:46:42,370 --> 00:46:47,440
be a number between
0 and m minus 1.

856
00:46:47,440 --> 00:46:47,940
OK.

857
00:46:47,940 --> 00:46:50,260
Why does this work?

858
00:46:50,260 --> 00:46:52,265
It's intuitive.

859
00:46:52,265 --> 00:46:54,390
In practice it works quite
well because what you're

860
00:46:54,390 --> 00:46:57,090
doing is taking a whole
bunch of sort of randomly

861
00:46:57,090 --> 00:47:00,200
shifted copies of k, adding
them up-- you get carries,

862
00:47:00,200 --> 00:47:02,690
things get mixed
up-- This is hashing.

863
00:47:02,690 --> 00:47:04,830
This is-- you're taking
k, sort of cutting it up

864
00:47:04,830 --> 00:47:08,040
while you're shifting it around,
adding things and they collide,

865
00:47:08,040 --> 00:47:09,660
and weird stuff happens.

866
00:47:09,660 --> 00:47:11,670
You sort of randomize stuff.

867
00:47:11,670 --> 00:47:13,440
Out here, you don't
get much randomization

868
00:47:13,440 --> 00:47:15,420
because most-- like
the last bit could just

869
00:47:15,420 --> 00:47:16,920
be this one bit of k.

870
00:47:16,920 --> 00:47:19,730
But in the middle, everybody's
kind of colliding together.

871
00:47:19,730 --> 00:47:21,190
And so intuitively,
you're mixing

872
00:47:21,190 --> 00:47:22,650
lots of things in the center.

873
00:47:22,650 --> 00:47:25,310
You take those r bits,
roughly, in the center.

874
00:47:25,310 --> 00:47:27,550
That will be nicely mixed up.

875
00:47:27,550 --> 00:47:29,280
And most of the time
this works well.

876
00:47:29,280 --> 00:47:33,950
In practice it works well-- I
have some things written here.

877
00:47:33,950 --> 00:47:37,380
a better be odd, otherwise
you're throwing away stuff.

878
00:47:37,380 --> 00:47:39,980
And it should not be very
close to a power of 2.

879
00:47:39,980 --> 00:47:44,840
But it should be in between 2
to the r minus 1 and 2 to the r.

880
00:47:44,840 --> 00:47:47,080
Cool.

881
00:47:47,080 --> 00:47:47,580
One more.

882
00:47:52,750 --> 00:47:55,230
Again, theoretically,
this can be bad.

883
00:47:55,230 --> 00:47:57,930
And I leave it as an exercise
to find situations, find

884
00:47:57,930 --> 00:48:00,440
key values where this
does not do a good job.

885
00:48:03,790 --> 00:48:07,540
The cool method is
called universal hashing.

886
00:48:11,120 --> 00:48:14,495
This is something that's a
bit beyond the scope of 006.

887
00:48:14,495 --> 00:48:17,440
If you want to understand it
better you should take 046.

888
00:48:17,440 --> 00:48:21,437
But I'll give you the flavor and
the method, one of the methods.

889
00:48:21,437 --> 00:48:23,020
There's actually
many ways to do this.

890
00:48:33,690 --> 00:48:34,999
We see a mod m on the outside.

891
00:48:34,999 --> 00:48:37,540
That's just division method just
to make the number between 0

892
00:48:37,540 --> 00:48:39,760
and a minus 1.

893
00:48:39,760 --> 00:48:41,245
Here's our key.

894
00:48:41,245 --> 00:48:42,870
And then there's
these numbers a and b.

895
00:48:42,870 --> 00:48:49,220
These are going to be
random numbers between 0

896
00:48:49,220 --> 00:48:51,350
and p minus 1.

897
00:48:51,350 --> 00:48:52,490
What's p?

898
00:48:52,490 --> 00:48:58,660
Prime number bigger than
the size of the universe.

899
00:48:58,660 --> 00:49:00,430
So it's a big prime number.

900
00:49:00,430 --> 00:49:03,870
I think we know how
to find prime numbers.

901
00:49:03,870 --> 00:49:05,770
We don't know in this
class, but people

902
00:49:05,770 --> 00:49:07,740
know how to find
the prime numbers.

903
00:49:07,740 --> 00:49:09,977
So there's a subroutine
here, find a big prime number

904
00:49:09,977 --> 00:49:11,060
bigger than your universe.

905
00:49:11,060 --> 00:49:12,268
It's not too hard to do that.

906
00:49:12,268 --> 00:49:15,369
We can do it in polynomial time.

907
00:49:15,369 --> 00:49:16,160
That's just set up.

908
00:49:16,160 --> 00:49:19,220
You do that once for
a given size table.

909
00:49:19,220 --> 00:49:23,916
And then you choose two
random numbers, a and b.

910
00:49:23,916 --> 00:49:25,790
And then this is the
hash function, a times k

911
00:49:25,790 --> 00:49:28,980
plus b, mod p mod m.

912
00:49:28,980 --> 00:49:29,480
OK.

913
00:49:29,480 --> 00:49:32,590
What does this do?

914
00:49:32,590 --> 00:49:35,810
It turns out-- here's
the interesting part.

915
00:49:35,810 --> 00:49:45,260
For worst case keys, k1
and k2, that are distinct,

916
00:49:45,260 --> 00:49:56,650
the probability of h of k1
equaling h of k2 is 1 over n.

917
00:49:56,650 --> 00:49:59,820
So probability of two keys
that are different colliding

918
00:49:59,820 --> 00:50:03,072
is 1 over m, for
the worst case keys.

919
00:50:03,072 --> 00:50:04,280
What the heck does that mean?

920
00:50:04,280 --> 00:50:05,770
What's the probability over?

921
00:50:05,770 --> 00:50:08,390
Any suggestions?

922
00:50:08,390 --> 00:50:11,450
What's random here?

923
00:50:11,450 --> 00:50:12,200
AUDIENCE: a and b.

924
00:50:12,200 --> 00:50:13,090
PROFESSOR: a and b.

925
00:50:13,090 --> 00:50:15,250
This is the probability
over a and b.

926
00:50:15,250 --> 00:50:18,350
This is the probability over the
choice of your hash function.

927
00:50:18,350 --> 00:50:22,030
So it's the worst case
inputs, worst case insertions,

928
00:50:22,030 --> 00:50:24,730
but random hash function.

929
00:50:24,730 --> 00:50:26,730
As long as you choose
your random hash function,

930
00:50:26,730 --> 00:50:28,550
the probability of
collision is 1 over m.

931
00:50:28,550 --> 00:50:31,130
This is the ideal situation

932
00:50:31,130 --> 00:50:34,140
And so you can prove, just
like we analyzed here--

933
00:50:34,140 --> 00:50:35,140
It's a little more work.

934
00:50:35,140 --> 00:50:35,910
It's in the notes.

935
00:50:35,910 --> 00:50:37,560
You use linearity
of expectation.

936
00:50:37,560 --> 00:50:39,700
And you can prove, still,
that the expected length

937
00:50:39,700 --> 00:50:42,620
of a chain-- the expected number
of collisions that a key has

938
00:50:42,620 --> 00:50:48,720
with another key is the load
factor, in the worst case,

939
00:50:48,720 --> 00:50:51,502
but in expectation for
a given hash function.

940
00:50:51,502 --> 00:50:53,210
So still, the expected
length of a chain,

941
00:50:53,210 --> 00:50:55,400
and therefore, the
expected running time

942
00:50:55,400 --> 00:50:58,334
of hashing with chaining,
using this hash function,

943
00:50:58,334 --> 00:51:00,750
or this collection of hash
functions, or a randomly chosen

944
00:51:00,750 --> 00:51:03,450
one, is constant for
constant load factor.

945
00:51:03,450 --> 00:51:05,689
And that's why hashing
really works in theory.

946
00:51:05,689 --> 00:51:07,730
We're not going to go into
details of this again.

947
00:51:07,730 --> 00:51:09,660
Take 6.046 if you want to know.

948
00:51:09,660 --> 00:51:12,470
But this should make you
feel more comfortable.

949
00:51:12,470 --> 00:51:15,490
And we'll see other ways
do hashing next class.