1
00:00:00,050 --> 00:00:01,770
The following
content is provided

2
00:00:01,770 --> 00:00:04,000
under a Creative
Commons license.

3
00:00:04,000 --> 00:00:06,850
Your support will help MIT
OpenCourseWare continue

4
00:00:06,850 --> 00:00:10,710
to offer high quality
educational resources for free.

5
00:00:10,710 --> 00:00:13,320
To make a donation or
view additional materials

6
00:00:13,320 --> 00:00:17,187
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,187 --> 00:00:17,812
at ocw.mit.edu.

8
00:00:22,300 --> 00:00:27,160
PROFESSOR: One more
exacting lecture on hashing.

9
00:00:27,160 --> 00:00:29,010
And a couple reminders.

10
00:00:29,010 --> 00:00:32,960
I don't want to start out
saying unpopular things,

11
00:00:32,960 --> 00:00:37,410
but we do have a quiz coming
up next week on Tuesday.

12
00:00:37,410 --> 00:00:41,430
There will not be a
lecture next Tuesday,

13
00:00:41,430 --> 00:00:43,110
but there will be a quiz.

14
00:00:43,110 --> 00:00:47,690
7:30 to 9:30 Tuesday evening.

15
00:00:47,690 --> 00:00:49,140
I will send announcement.

16
00:00:49,140 --> 00:00:51,480
There's going to
be a couple rooms.

17
00:00:51,480 --> 00:00:52,930
Some of you will
be in this room.

18
00:00:52,930 --> 00:00:54,929
Some of you will have to
go to a different room,

19
00:00:54,929 --> 00:00:56,890
since this room
really can't hold

20
00:00:56,890 --> 00:01:00,070
180 students taking a quiz.

21
00:01:00,070 --> 00:01:01,400
All right?

22
00:01:01,400 --> 00:01:04,047
So hashing.

23
00:01:04,047 --> 00:01:05,630
I'm pretty excited
about this lecture,

24
00:01:05,630 --> 00:01:09,280
because I think as I was
talking with Victor just

25
00:01:09,280 --> 00:01:12,660
before this, if there's one
thing you want to remember

26
00:01:12,660 --> 00:01:16,700
about hashing and you want
to go implement a hash table,

27
00:01:16,700 --> 00:01:18,410
it's open addressing.

28
00:01:18,410 --> 00:01:20,680
It's the simplest way
that you can possibly

29
00:01:20,680 --> 00:01:22,720
implement a hash table.

30
00:01:22,720 --> 00:01:26,030
You can implement a hash
table using an array.

31
00:01:26,030 --> 00:01:30,000
We've obviously talked
about link lists

32
00:01:30,000 --> 00:01:35,360
and chaining to implement hash
tables in previous lectures,

33
00:01:35,360 --> 00:01:38,930
but we're going to actually get
rid of pointers and link lists,

34
00:01:38,930 --> 00:01:42,890
and implement a hash table using
a single array data structure,

35
00:01:42,890 --> 00:01:46,830
and that's the notion
of open addressing.

36
00:01:46,830 --> 00:01:49,360
Now in order to get
open addressing to work,

37
00:01:49,360 --> 00:01:50,660
there's no free lunch, right?

38
00:01:50,660 --> 00:01:52,560
So you have a simple
implementation.

39
00:01:52,560 --> 00:01:56,240
It turns out that in order to
make open addressing efficient,

40
00:01:56,240 --> 00:01:58,800
you have to be a little
more careful than if you're

41
00:01:58,800 --> 00:02:02,390
using the hash
tables with chaining.

42
00:02:02,390 --> 00:02:05,070
And we're going to have
to make an assumption

43
00:02:05,070 --> 00:02:06,706
about uniform hashing.

44
00:02:06,706 --> 00:02:08,289
I'll say a little
bit more about that.

45
00:02:08,289 --> 00:02:11,780
But it's a different assumption
from simple uniform hashing

46
00:02:11,780 --> 00:02:13,480
that Eric talked about.

47
00:02:13,480 --> 00:02:16,160
And we'll state this
uniform hashing assumption.

48
00:02:16,160 --> 00:02:20,990
And we look at what the
performance is of open

49
00:02:20,990 --> 00:02:23,390
addressing under
this assumption.

50
00:02:23,390 --> 00:02:26,330
And this is assumption
is going to give us

51
00:02:26,330 --> 00:02:30,470
a sense of what good
hash functions are

52
00:02:30,470 --> 00:02:33,350
for open addressing applications
or for open addressing

53
00:02:33,350 --> 00:02:34,770
hash tables.

54
00:02:34,770 --> 00:02:39,330
And finally we'll talk
about cryptographic hashing.

55
00:02:39,330 --> 00:02:42,100
This is not really
6006 material,

56
00:02:42,100 --> 00:02:44,290
but it's kind of cool material.

57
00:02:44,290 --> 00:02:47,890
It has a lot of applications
in computer security

58
00:02:47,890 --> 00:02:49,100
and cryptography.

59
00:02:49,100 --> 00:02:53,710
And so as we'll describe the
notion of a cryptographic hash,

60
00:02:53,710 --> 00:02:57,970
and we'll talk about a couple
of real simple and pervasive

61
00:02:57,970 --> 00:03:00,560
applications like
password storage

62
00:03:00,560 --> 00:03:05,240
and file corruption detectors
that you can implement

63
00:03:05,240 --> 00:03:07,450
using cryptographic
hash functions, which

64
00:03:07,450 --> 00:03:10,440
are quite different from
the regular hash functions

65
00:03:10,440 --> 00:03:13,060
that we're using in hash tables.

66
00:03:13,060 --> 00:03:18,460
Be it chaining hash tables or
open addressing hash tables.

67
00:03:18,460 --> 00:03:19,690
All right?

68
00:03:19,690 --> 00:03:23,120
So let's get started and
talk about open addressing.

69
00:03:30,080 --> 00:03:33,909
This is another approach
to dealing with collisions.

70
00:03:33,909 --> 00:03:35,950
If you didn't have
collisions, obviously an array

71
00:03:35,950 --> 00:03:37,190
would work, right?

72
00:03:37,190 --> 00:03:39,734
If you could somehow guarantee
that there were no collisions.

73
00:03:39,734 --> 00:03:41,150
When you have
collisions, you have

74
00:03:41,150 --> 00:03:44,450
to worry about the
chaining and ensuring

75
00:03:44,450 --> 00:03:46,690
that you can still find
the keys even though you

76
00:03:46,690 --> 00:03:50,800
had two keys that collided
into the same slot.

77
00:03:50,800 --> 00:03:54,090
And we don't want
to use chaining.

78
00:03:56,820 --> 00:03:59,910
The simplest data structure that
we can possibly use are arrays.

79
00:03:59,910 --> 00:04:04,430
Back when I was a grad student,
I went through and got a PhD

80
00:04:04,430 --> 00:04:08,940
writing programs in C, never
using any other structure

81
00:04:08,940 --> 00:04:12,370
than arrays, because I
didn't like pointers.

82
00:04:12,370 --> 00:04:15,530
And so open addressing
is a way that you

83
00:04:15,530 --> 00:04:18,810
can implement hash tables
doing exactly this.

84
00:04:18,810 --> 00:04:22,300
And in particular,
what we're going to do

85
00:04:22,300 --> 00:04:25,055
is assume an array
structure with items.

86
00:04:31,810 --> 00:04:37,390
And we're going to assume
that this one item-- at most

87
00:04:37,390 --> 00:04:38,630
one item per slot.

88
00:04:41,580 --> 00:04:44,970
So m has to be greater
than or equal to n, right?

89
00:04:44,970 --> 00:04:48,240
So this is important because
we don't have link lists.

90
00:04:48,240 --> 00:04:51,960
We can't arbitrarily
increase the storage

91
00:04:51,960 --> 00:04:56,860
of a slot using
a chain, and have

92
00:04:56,860 --> 00:04:59,060
n, which is the
number of elements,

93
00:04:59,060 --> 00:05:01,040
be greater than m, right?

94
00:05:01,040 --> 00:05:06,290
Which you could in the link
list table with chaining.

95
00:05:06,290 --> 00:05:09,100
But here you only have
these area locations,

96
00:05:09,100 --> 00:05:11,510
these indices that you
can put items into.

97
00:05:11,510 --> 00:05:16,620
So it's pretty much guaranteed
that if you want a working open

98
00:05:16,620 --> 00:05:23,990
addressing hash table that m,
which is the number of slots

99
00:05:23,990 --> 00:05:29,080
in the table, should be greater
than or equal to the number

100
00:05:29,080 --> 00:05:31,660
of elements, all right?

101
00:05:31,660 --> 00:05:34,580
That's important.

102
00:05:34,580 --> 00:05:36,510
Now how does this work.

103
00:05:36,510 --> 00:05:38,990
Well, we're going to have
this notion of probing.

104
00:05:44,250 --> 00:05:48,160
And the notion of
probing is that we're

105
00:05:48,160 --> 00:05:53,160
going to try to see if
we can insert something

106
00:05:53,160 --> 00:05:56,016
into this hash table,
and if you fail

107
00:05:56,016 --> 00:05:57,390
we're actually
going to recompute

108
00:05:57,390 --> 00:06:00,410
a slightly different
hash for the key

109
00:06:00,410 --> 00:06:02,160
that we're trying to
insert, the key value

110
00:06:02,160 --> 00:06:03,535
pair that we're
trying to insert.

111
00:06:03,535 --> 00:06:04,034
All right?

112
00:06:04,034 --> 00:06:05,960
So this is an iterative
process, and we're

113
00:06:05,960 --> 00:06:09,920
going to continually probe
until we find an empty slot

114
00:06:09,920 --> 00:06:13,560
into which we can insert
this key value pair.

115
00:06:13,560 --> 00:06:15,750
The key should index into it.

116
00:06:15,750 --> 00:06:19,570
So you do have
different hashes that

117
00:06:19,570 --> 00:06:22,190
are going to be computed
based on this probing

118
00:06:22,190 --> 00:06:24,480
notion for a given key.

119
00:06:24,480 --> 00:06:27,050
All right?

120
00:06:27,050 --> 00:06:31,390
And so what we need
now is a hash function

121
00:06:31,390 --> 00:06:35,370
that's different from the
standard hash functions

122
00:06:35,370 --> 00:06:42,470
that we've talked about
so far, which specifies

123
00:06:42,470 --> 00:06:51,210
the order of slots to
probe, which is basically

124
00:06:51,210 --> 00:06:52,350
to try for a key.

125
00:06:58,570 --> 00:07:06,080
And this is going to be true
for insert, search, and delete,

126
00:07:06,080 --> 00:07:08,190
which are three
basic operations.

127
00:07:08,190 --> 00:07:10,820
And they're a little bit
different, all right?

128
00:07:10,820 --> 00:07:14,224
Just like they were different
for the chaining hash table,

129
00:07:14,224 --> 00:07:16,640
they're different here, but
they're kind of more different

130
00:07:16,640 --> 00:07:17,190
here.

131
00:07:17,190 --> 00:07:19,620
And you'll see what I mean
when we go through this.

132
00:07:22,660 --> 00:07:25,270
And this is not
just for one slot.

133
00:07:25,270 --> 00:07:28,190
It's going to specify
an order of slots.

134
00:07:28,190 --> 00:07:32,180
And so our hash
function h is going

135
00:07:32,180 --> 00:07:49,890
to take the universe
of keys and also take

136
00:07:49,890 --> 00:07:53,120
what we're going to
call the trial count.

137
00:07:53,120 --> 00:07:57,660
So if you're lucky-- well, you
get lucky in your first trial.

138
00:07:57,660 --> 00:08:01,320
And if you're not, you hope to
get lucky in your second trial,

139
00:08:01,320 --> 00:08:02,710
and so on and so forth.

140
00:08:02,710 --> 00:08:08,090
But the hash function is
going to take two arguments.

141
00:08:08,090 --> 00:08:12,580
It's going to take the
key as an argument,

142
00:08:12,580 --> 00:08:17,110
and it's going to take a trial,
which is an integer between 0

143
00:08:17,110 --> 00:08:19,440
to n minus 1, all right?

144
00:08:19,440 --> 00:08:23,960
And it's going to produce-- just
like the chaining hash function

145
00:08:23,960 --> 00:08:31,660
it's going to produce a number
between 0 and m minus 1, right?

146
00:08:31,660 --> 00:08:34,030
Where m is the number
of slots in the table.

147
00:08:34,030 --> 00:08:35,919
All right.

148
00:08:35,919 --> 00:08:39,150
So that's the story.

149
00:08:39,150 --> 00:08:48,360
In order to ensure that you
are using the hash table

150
00:08:48,360 --> 00:08:54,770
corresponding to open addressing
properly, what you want

151
00:08:54,770 --> 00:09:01,680
is-- and this is an important
property-- that h k 1,

152
00:09:01,680 --> 00:09:03,640
so that's a key
that you're given.

153
00:09:03,640 --> 00:09:08,360
And this could be an
arbitrary key, mind you.

154
00:09:08,360 --> 00:09:17,770
So arbitrary key k.

155
00:09:17,770 --> 00:09:20,430
And what you have in
terms of the slots that

156
00:09:20,430 --> 00:09:27,880
are being computed is
this, h k 1, h k 2,

157
00:09:27,880 --> 00:09:33,520
and so on and so forth
to h k n minus 1.

158
00:09:33,520 --> 00:09:40,990
And what you want
is for this vector

159
00:09:40,990 --> 00:09:54,510
to be a permutation of 0
1 and so on to n minus 1.

160
00:09:54,510 --> 00:09:57,320
And the reason for this
hopefully is clear.

161
00:09:57,320 --> 00:10:01,900
It's because you want
to be able to use

162
00:10:01,900 --> 00:10:04,660
the entirety of your hash table.

163
00:10:04,660 --> 00:10:07,810
You don't want particular
slots to go unused.

164
00:10:07,810 --> 00:10:13,920
And when you get to the point
where the number of elements n

165
00:10:13,920 --> 00:10:20,030
is pretty close to m, and maybe
there's just one slot left, OK?

166
00:10:20,030 --> 00:10:25,280
And you want to fill up this
last slot with this key k

167
00:10:25,280 --> 00:10:27,740
that you want to put
in there, and what

168
00:10:27,740 --> 00:10:30,460
you want to be able to say is
that for this arbitrary key k

169
00:10:30,460 --> 00:10:34,260
that you want to put in there
that the one slot that's free--

170
00:10:34,260 --> 00:10:35,730
and it could be that first slot.

171
00:10:35,730 --> 00:10:37,400
It could be the 17th slot.

172
00:10:37,400 --> 00:10:40,340
Whatever-- That eventually
the sequence of probes

173
00:10:40,340 --> 00:10:43,920
is going to be able to allow
you to insert into that slot.

174
00:10:43,920 --> 00:10:45,230
All right?

175
00:10:45,230 --> 00:10:47,450
And we generalize
this notion into

176
00:10:47,450 --> 00:10:51,180
the uniform hashing
assumption in a few minutes,

177
00:10:51,180 --> 00:10:53,650
but hopefully this makes
sense from a standpoint

178
00:10:53,650 --> 00:10:57,150
of really load
balancing the table

179
00:10:57,150 --> 00:11:00,590
and ensuring that all
slots in the table

180
00:11:00,590 --> 00:11:02,600
are sort of equal
opportunity slots.

181
00:11:02,600 --> 00:11:08,270
That you're going to be able to
put keys in them as long as you

182
00:11:08,270 --> 00:11:11,670
probe long enough that you're
going to be able to get there.

183
00:11:11,670 --> 00:11:14,150
Now of course the
fact that you're

184
00:11:14,150 --> 00:11:16,650
using one particular slot
for one particular key

185
00:11:16,650 --> 00:11:18,580
depends on the order
of keys that you're

186
00:11:18,580 --> 00:11:20,300
inserting into this table.

187
00:11:20,300 --> 00:11:24,140
Again, you'll see that as we go
through an example, all right?

188
00:11:24,140 --> 00:11:25,260
So that's the set up.

189
00:11:25,260 --> 00:11:27,800
That's the open
addressing notion.

190
00:11:27,800 --> 00:11:30,670
And that as you
can see, we're just

191
00:11:30,670 --> 00:11:34,080
going to go through
a sequence of probes

192
00:11:34,080 --> 00:11:36,200
and our hash function
is going to tell us

193
00:11:36,200 --> 00:11:38,950
what the sequences is, and
so we don't need any pointers

194
00:11:38,950 --> 00:11:41,350
or anything like that.

195
00:11:41,350 --> 00:11:50,340
So let's take a look at how
this might work in practice.

196
00:11:50,340 --> 00:11:55,890
So maybe the easiest thing to
do is to run through an example,

197
00:11:55,890 --> 00:11:57,870
and then I'll show
you some pseudocode.

198
00:11:57,870 --> 00:12:01,800
But let's say that
I have a table here,

199
00:12:01,800 --> 00:12:07,060
and I'm going to concentrate
on the insert operation.

200
00:12:07,060 --> 00:12:10,530
And I'm going to start inserting
things into this table.

201
00:12:17,130 --> 00:12:19,690
And right here I have
seven slots up there.

202
00:12:19,690 --> 00:12:27,840
So let's say that I want to
insert 586 into the table,

203
00:12:27,840 --> 00:12:35,020
and I compute h of 586 comma
1, and that gives me 1.

204
00:12:35,020 --> 00:12:35,760
OK?

205
00:12:35,760 --> 00:12:37,220
This is the first insert.

206
00:12:37,220 --> 00:12:42,650
So I'm going to go ahead and
stick 586 in here, all right?

207
00:12:42,650 --> 00:12:47,730
And then I insert, for
argument's sake, 133.

208
00:12:47,730 --> 00:12:50,600
I insert 204 out here.

209
00:12:50,600 --> 00:12:54,490
And these are all things
because the hash table is empty.

210
00:12:54,490 --> 00:12:57,900
481 out here and so on.

211
00:12:57,900 --> 00:12:59,800
And because the
hash table is empty,

212
00:12:59,800 --> 00:13:03,800
my very first trial is
successful, all right?

213
00:13:03,800 --> 00:13:11,190
So h of 481-- I'm not going to
write this all out, but h 481 1

214
00:13:11,190 --> 00:13:15,280
happens to be 6 and so on.

215
00:13:15,280 --> 00:13:15,820
All right?

216
00:13:15,820 --> 00:13:24,910
Now I get to the point
where I want to insert 496.

217
00:13:24,910 --> 00:13:40,700
And when I try to insert
496, I have h 496 1.

218
00:13:40,700 --> 00:13:43,490
It happens to be 4.

219
00:13:43,490 --> 00:13:44,450
OK?

220
00:13:44,450 --> 00:13:48,230
So the first thing that
happens is I go in here,

221
00:13:48,230 --> 00:13:50,070
and I say oops.

222
00:13:50,070 --> 00:13:54,990
This slot is occupied,
because this-- I'm

223
00:13:54,990 --> 00:14:00,470
going to have a special flag
associated with an empty slot,

224
00:14:00,470 --> 00:14:03,830
and we can say it's none.

225
00:14:03,830 --> 00:14:06,020
And if it's not none,
then it's occupied.

226
00:14:06,020 --> 00:14:08,080
And 204 is not equal to none.

227
00:14:08,080 --> 00:14:14,510
So I look at this, and I say
the first probe actually failed.

228
00:14:14,510 --> 00:14:15,160
OK?

229
00:14:15,160 --> 00:14:30,150
And so h 496 1 equals 4 fails,
so I need to go do h 496 2.

230
00:14:30,150 --> 00:14:36,180
And h 496 2 may also fail.

231
00:14:36,180 --> 00:14:45,910
You might be in a situation
where h 496 2 gives you 586.

232
00:14:45,910 --> 00:14:56,850
So this was h 496 1 h
496 2 might give you 586.

233
00:14:56,850 --> 00:15:03,650
And finally it may be that h 496
3, which is your third attempt,

234
00:15:03,650 --> 00:15:05,130
equals 3.

235
00:15:05,130 --> 00:15:07,560
So you go in, and you say great.

236
00:15:07,560 --> 00:15:10,210
I can insert 496.

237
00:15:10,210 --> 00:15:11,770
And let me write
that in bold here.

238
00:15:14,830 --> 00:15:16,190
Out there.

239
00:15:16,190 --> 00:15:16,780
All right?

240
00:15:16,780 --> 00:15:18,750
So pretty straightforward.

241
00:15:18,750 --> 00:15:23,620
In this case, you've gone
through three trials in order

242
00:15:23,620 --> 00:15:25,490
to find an empty slot.

243
00:15:25,490 --> 00:15:28,560
And so the big
question really here is

244
00:15:28,560 --> 00:15:32,580
other than taking care of
search and delete, how long is

245
00:15:32,580 --> 00:15:34,060
this process going to take?

246
00:15:34,060 --> 00:15:34,990
All right?

247
00:15:34,990 --> 00:15:37,800
And I'm talking about
that in a few minutes,

248
00:15:37,800 --> 00:15:41,350
but let me explain,
now that you've

249
00:15:41,350 --> 00:15:45,820
seen insert, how search
would work, right?

250
00:15:45,820 --> 00:15:50,510
Or maybe I get one of
you guys to explain to me

251
00:15:50,510 --> 00:15:55,230
once you have insert,
how would search work?

252
00:15:55,230 --> 00:15:55,730
Someone?

253
00:15:58,550 --> 00:15:59,938
Someone from the back?

254
00:16:03,290 --> 00:16:04,800
No one.

255
00:16:04,800 --> 00:16:08,450
You guys are always
answering questions.

256
00:16:08,450 --> 00:16:09,800
Yeah, all the way in the back.

257
00:16:09,800 --> 00:16:11,960
AUDIENCE: Would you
just do the same kind

258
00:16:11,960 --> 00:16:18,022
of probing [INAUDIBLE] where you
find it or you don't find it?

259
00:16:18,022 --> 00:16:18,730
PROFESSOR: Right.

260
00:16:18,730 --> 00:16:19,560
So you do exactly.

261
00:16:19,560 --> 00:16:20,790
It's very similar to insert.

262
00:16:23,810 --> 00:16:26,380
You have a situation
where you're

263
00:16:26,380 --> 00:16:33,840
going to none would
indicate an empty slot.

264
00:16:33,840 --> 00:16:37,660
And you can think of
this as being a flag.

265
00:16:37,660 --> 00:16:45,450
And in the case of insert,
what you did was you--

266
00:16:45,450 --> 00:16:53,120
insert k v would
say keep probing.

267
00:16:53,120 --> 00:16:56,020
I'm not going to write
the pseudocode for it.

268
00:16:56,020 --> 00:17:03,980
Keep probing until an
empty slot is found.

269
00:17:06,630 --> 00:17:08,480
And then when it's
found, insert item.

270
00:17:16,560 --> 00:17:19,930
And as long as you have
the permutation property

271
00:17:19,930 --> 00:17:23,150
that we have up there, and
given that m is greater than

272
00:17:23,150 --> 00:17:26,260
or equal to n, you're
guaranteed that insert

273
00:17:26,260 --> 00:17:28,060
is going to find a slot.

274
00:17:28,060 --> 00:17:28,560
OK?

275
00:17:28,560 --> 00:17:29,870
That's the good news.

276
00:17:29,870 --> 00:17:31,420
Now it might take
awhile, and so we

277
00:17:31,420 --> 00:17:35,970
have a talk about performance
a bit later, but it'll work.

278
00:17:35,970 --> 00:17:36,800
OK?

279
00:17:36,800 --> 00:17:39,110
Now search is a
little bit different.

280
00:17:42,490 --> 00:17:50,706
You're searching for a
key k, and you essentially

281
00:17:50,706 --> 00:17:52,080
say you're going
to keep probing.

282
00:17:52,080 --> 00:18:04,290
And you say as long as
the slots encountered

283
00:18:04,290 --> 00:18:14,160
are occupied by
keys not equal to k.

284
00:18:14,160 --> 00:18:16,440
So every time you
probe, you go in there

285
00:18:16,440 --> 00:18:18,440
and you say I got a key.

286
00:18:18,440 --> 00:18:20,830
I found a hash for it.

287
00:18:20,830 --> 00:18:22,500
I go to this particular slot.

288
00:18:22,500 --> 00:18:25,270
I look inside of it,
and I check to see

289
00:18:25,270 --> 00:18:28,000
whether the key that's
stored inside of it

290
00:18:28,000 --> 00:18:31,170
is the same as the
key I'm searching for.

291
00:18:31,170 --> 00:18:34,990
If not, I go to the next trial.

292
00:18:34,990 --> 00:18:37,130
If it is, then I return it.

293
00:18:37,130 --> 00:18:37,630
Right?

294
00:18:37,630 --> 00:18:41,440
So that's pretty much it.

295
00:18:41,440 --> 00:19:00,690
And we keep probing until you
either encounter k or find

296
00:19:00,690 --> 00:19:01,420
an empty slot.

297
00:19:04,930 --> 00:19:05,920
And this is the key.

298
00:19:08,714 --> 00:19:09,380
No pun intended.

299
00:19:12,230 --> 00:19:16,680
A notion which is that when
you find an empty slot,

300
00:19:16,680 --> 00:19:21,840
it means that you have
failed to discover this key.

301
00:19:21,840 --> 00:19:24,272
You fail to-- yeah,
question back there?

302
00:19:24,272 --> 00:19:27,170
AUDIENCE: What happens if you
were to delete a key though?

303
00:19:27,170 --> 00:19:29,670
PROFESSOR: I'll make you answer
that question for a cushion.

304
00:19:32,200 --> 00:19:34,744
So we'll get to
delete in a minute.

305
00:19:34,744 --> 00:19:36,160
But I want to make
sure you're all

306
00:19:36,160 --> 00:19:39,170
on board with insert and search.

307
00:19:39,170 --> 00:19:39,920
OK?

308
00:19:39,920 --> 00:19:43,280
So these are actually fairly
straightforward in comparison

309
00:19:43,280 --> 00:19:43,780
to delete.

310
00:19:43,780 --> 00:19:45,850
It's not like delete is
much more complicated,

311
00:19:45,850 --> 00:19:48,854
but there is a subtlety there.

312
00:19:48,854 --> 00:19:50,270
And so that's kind
of neat, right?

313
00:19:50,270 --> 00:19:52,630
I mean this actually works.

314
00:19:52,630 --> 00:19:58,700
So if you had a situation where
you were just accumulating

315
00:19:58,700 --> 00:20:02,920
keys, and you're looking for
the number of distinct elements

316
00:20:02,920 --> 00:20:05,360
in the stream of data
that was coming in,

317
00:20:05,360 --> 00:20:08,090
and that was pretty much it
with respect to your program,

318
00:20:08,090 --> 00:20:11,940
you'd never have to delete
keys, and this would be all

319
00:20:11,940 --> 00:20:13,410
that you'd have to implement.

320
00:20:13,410 --> 00:20:14,280
Right?

321
00:20:14,280 --> 00:20:17,690
But let's talk about delete.

322
00:20:17,690 --> 00:20:19,870
Every once in awhile we'd
want to delete a key?

323
00:20:19,870 --> 00:20:20,570
Yeah, you had a question?

324
00:20:20,570 --> 00:20:22,278
AUDIENCE: I have a
question about search.

325
00:20:22,278 --> 00:20:25,350
Why do you stop searching
once you find an empty slot?

326
00:20:25,350 --> 00:20:27,070
PROFESSOR: Because
you're searching.

327
00:20:27,070 --> 00:20:30,010
So what that means
is that you're

328
00:20:30,010 --> 00:20:34,120
looking to see if this key
were already in the table.

329
00:20:34,120 --> 00:20:37,150
And if key were
already in the table,

330
00:20:37,150 --> 00:20:39,870
you want to return the value
associated with that key.

331
00:20:39,870 --> 00:20:42,260
If you find an empty
slot, since you're

332
00:20:42,260 --> 00:20:47,540
using the same deterministic
sequence of probes

333
00:20:47,540 --> 00:20:50,220
that you would have if
you had inserted it,

334
00:20:50,220 --> 00:20:52,210
then-- that make sense?

335
00:20:52,210 --> 00:20:53,320
Good.

336
00:20:53,320 --> 00:20:54,080
All right.

337
00:20:54,080 --> 00:20:56,500
So so far so good?

338
00:20:56,500 --> 00:21:00,550
That's what works for
insert and search.

339
00:21:00,550 --> 00:21:01,530
Let's talk delete.

340
00:21:01,530 --> 00:21:04,216
So back there.

341
00:21:04,216 --> 00:21:05,210
How does delete work?

342
00:21:09,070 --> 00:21:12,428
AUDIENCE: Well
[INAUDIBLE] if you

343
00:21:12,428 --> 00:21:16,412
search until you find
the none and assume

344
00:21:16,412 --> 00:21:20,396
that the key you're searching
for was not put in there.

345
00:21:20,396 --> 00:21:25,210
But let's say you had one
that was in that slot before

346
00:21:25,210 --> 00:21:26,710
and it got put back
in, but then you

347
00:21:26,710 --> 00:21:28,501
delete the one that
was in the slot before.

348
00:21:28,501 --> 00:21:29,747
PROFESSOR: Great, great.

349
00:21:29,747 --> 00:21:31,330
You haven't told me
how to fix it yet,

350
00:21:31,330 --> 00:21:35,340
but do you have
the guts for this?

351
00:21:35,340 --> 00:21:37,040
No.

352
00:21:37,040 --> 00:21:39,460
OK, I think this
veers to the right.

353
00:21:39,460 --> 00:21:41,906
I always wanted to do this
to somebody in the back.

354
00:21:41,906 --> 00:21:44,236
All right.

355
00:21:44,236 --> 00:21:45,170
Whoa.

356
00:21:45,170 --> 00:21:48,580
All right, good catch.

357
00:21:48,580 --> 00:21:49,230
All right.

358
00:21:49,230 --> 00:21:49,820
OK.

359
00:21:49,820 --> 00:21:51,830
So you pointed out
the problem, and I'm

360
00:21:51,830 --> 00:21:53,800
going to ask somebody
else for a solution.

361
00:21:53,800 --> 00:21:55,800
All right?

362
00:21:55,800 --> 00:21:57,570
But here's the problem.

363
00:21:57,570 --> 00:21:59,100
Here's the problem,
and we can look

364
00:21:59,100 --> 00:22:04,560
at it from a standpoint of
that example right there.

365
00:22:04,560 --> 00:22:08,700
Let's say for argument's
sake that I'm searching-- now

366
00:22:08,700 --> 00:22:11,840
I've done all of the inserts
that I have up there, OK?

367
00:22:11,840 --> 00:22:14,200
So I've inserted 496.

368
00:22:14,200 --> 00:22:14,860
All right?

369
00:22:14,860 --> 00:22:21,840
Then I delete 586
from the table, OK?

370
00:22:21,840 --> 00:22:24,500
I delete 586 from the table.

371
00:22:24,500 --> 00:22:30,080
So let's just say
that what I end up

372
00:22:30,080 --> 00:22:38,910
doing-- I have 586,
133, 496, and then

373
00:22:38,910 --> 00:22:42,780
I have 204, and then a 481.

374
00:22:42,780 --> 00:22:47,770
And this is 0, 1, 2, et cetera.

375
00:22:47,770 --> 00:22:52,270
So I'm deleting 586, and let's
say I replace it with none.

376
00:22:52,270 --> 00:22:53,300
OK?

377
00:22:53,300 --> 00:22:55,130
Let's just say I
replace it with none.

378
00:22:55,130 --> 00:23:03,670
Now what happens is that when
I search for 496, according

379
00:23:03,670 --> 00:23:09,940
to this search algorithm
what am I going to get?

380
00:23:09,940 --> 00:23:12,040
AUDIENCE: None.

381
00:23:12,040 --> 00:23:15,690
PROFESSOR: Well the first slot
I'm going to look at is 1,

382
00:23:15,690 --> 00:23:18,340
and according to this
search algorithm,

383
00:23:18,340 --> 00:23:21,030
I find an empty slot, right?

384
00:23:21,030 --> 00:23:23,270
And when I find
an empty slot, I'm

385
00:23:23,270 --> 00:23:26,700
going to say I
failed in the search.

386
00:23:26,700 --> 00:23:33,820
If you encounter k, you succeed
and return the key value pair,

387
00:23:33,820 --> 00:23:34,320
right?

388
00:23:34,320 --> 00:23:36,510
Success means you
return the value.

389
00:23:36,510 --> 00:23:38,790
And if you encounter
an empty slot,

390
00:23:38,790 --> 00:23:41,690
it means that you've
decided that this key is not

391
00:23:41,690 --> 00:23:43,630
in the table.

392
00:23:43,630 --> 00:23:46,510
And you say couldn't
find it, right?

393
00:23:46,510 --> 00:23:47,980
That make sense?

394
00:23:47,980 --> 00:23:49,970
So this is obviously
wrong, right?

395
00:23:49,970 --> 00:23:54,200
Because I just inserted
496 into the table.

396
00:23:54,200 --> 00:23:56,520
So this would fail incorrectly.

397
00:24:00,560 --> 00:24:02,990
So failed to find
the key, which is OK.

398
00:24:02,990 --> 00:24:05,200
I mean failure is OK
if the key isn't there.

399
00:24:05,200 --> 00:24:07,151
But you don't want
to fail incorrectly.

400
00:24:07,151 --> 00:24:07,650
Right?

401
00:24:07,650 --> 00:24:09,590
Everyone buy that?

402
00:24:09,590 --> 00:24:10,650
Everyone buy that?

403
00:24:10,650 --> 00:24:11,500
Good.

404
00:24:11,500 --> 00:24:12,000
All right.

405
00:24:12,000 --> 00:24:14,170
So how do I fix it.

406
00:24:14,170 --> 00:24:15,460
Someone else?

407
00:24:15,460 --> 00:24:16,960
How do I fix this?

408
00:24:16,960 --> 00:24:18,563
Someone who doesn't
have a cushion.

409
00:24:18,563 --> 00:24:20,686
All right, you.

410
00:24:20,686 --> 00:24:30,110
AUDIENCE: [INAUDIBLE] you can
mark that spot by a, and when

411
00:24:30,110 --> 00:24:34,580
search comes across a,
you just [INAUDIBLE].

412
00:24:34,580 --> 00:24:38,020
PROFESSOR: Right, great answer.

413
00:24:38,020 --> 00:24:40,480
We're now going to have to do
a couple of different things

414
00:24:40,480 --> 00:24:42,340
for insert and search, OK?

415
00:24:42,340 --> 00:24:44,019
It's going to be
subtly different,

416
00:24:44,019 --> 00:24:45,560
but the first thing
we're going to do

417
00:24:45,560 --> 00:24:46,934
is we're going to
have this flag,

418
00:24:46,934 --> 00:24:48,920
and I'll just call
it delete me flag.

419
00:24:48,920 --> 00:24:50,620
OK?

420
00:24:50,620 --> 00:25:00,350
And we're going to say that
when I delete something,

421
00:25:00,350 --> 00:25:09,960
replace deleted item
with not the non flag,

422
00:25:09,960 --> 00:25:15,200
but a different flag that
we'll call delete me.

423
00:25:15,200 --> 00:25:20,140
Is different from none.

424
00:25:24,230 --> 00:25:26,040
And that's going
to be important,

425
00:25:26,040 --> 00:25:28,600
because now that you
have a different flag,

426
00:25:28,600 --> 00:25:35,530
and you replace
586 with delete me,

427
00:25:35,530 --> 00:25:40,900
you can now do different things
in insert versus search, right?

428
00:25:40,900 --> 00:25:43,900
So in particular,
what you would do

429
00:25:43,900 --> 00:25:51,122
is you'd have to
modify this slightly,

430
00:25:51,122 --> 00:25:52,580
because the notion
of an empty slot

431
00:25:52,580 --> 00:25:55,380
means that you're
looking for none, right?

432
00:25:55,380 --> 00:26:00,650
And all it means is that--
well actually in some sense,

433
00:26:00,650 --> 00:26:02,500
the pseudo code
doesn't really change

434
00:26:02,500 --> 00:26:08,160
because if you say
you either encounter k

435
00:26:08,160 --> 00:26:14,510
or you would-- even if
you encounter a delete me,

436
00:26:14,510 --> 00:26:15,720
you keep going.

437
00:26:15,720 --> 00:26:16,220
All right?

438
00:26:16,220 --> 00:26:18,650
That's the important thing.

439
00:26:18,650 --> 00:26:20,570
So I guess it does
change, because I assume

440
00:26:20,570 --> 00:26:23,170
that you have only
two cases here,

441
00:26:23,170 --> 00:26:26,075
but what you really have
now are three cases.

442
00:26:26,075 --> 00:26:28,150
The three cases are
when you're doing

443
00:26:28,150 --> 00:26:30,860
the search is that you
encounter the key, which

444
00:26:30,860 --> 00:26:31,960
is the easy case.

445
00:26:31,960 --> 00:26:32,690
You return it.

446
00:26:32,690 --> 00:26:34,440
You return the value.

447
00:26:34,440 --> 00:26:38,530
Or you can encounter a
delete me flag, in which case

448
00:26:38,530 --> 00:26:40,240
you keep going.

449
00:26:40,240 --> 00:26:42,140
OK?

450
00:26:42,140 --> 00:26:44,930
And if you encounter
an empty slot, which

451
00:26:44,930 --> 00:26:47,012
corresponds to none,
at that point you know

452
00:26:47,012 --> 00:26:49,630
you failed and the key
doesn't exist in the table.

453
00:26:49,630 --> 00:26:50,570
All right?

454
00:26:50,570 --> 00:26:54,310
So let me just write that out.

455
00:26:54,310 --> 00:27:03,040
Insert treats delete
me the same as none.

456
00:27:07,250 --> 00:27:21,070
But search keeps going
and treats it differently.

457
00:27:32,117 --> 00:27:33,200
And that's pretty much it.

458
00:27:33,200 --> 00:27:35,260
So what would happen
in our example?

459
00:27:35,260 --> 00:27:39,840
Well, going through
exactly the same example,

460
00:27:39,840 --> 00:27:43,750
we started from here, and
then we decided to delete 586.

461
00:27:43,750 --> 00:27:51,580
And so if we replaced 586 not
with none, but with delete me,

462
00:27:51,580 --> 00:27:55,260
and the next time around
when you search for 496,

463
00:27:55,260 --> 00:27:57,360
you're searching for 496.

464
00:27:57,360 --> 00:27:58,870
And what would
happen is that you

465
00:27:58,870 --> 00:28:04,010
would go look at 586-- the
slot that contained 586,

466
00:28:04,010 --> 00:28:06,360
and you see that there's
a delete me flag in there.

467
00:28:06,360 --> 00:28:08,400
And so you go to the next trial.

468
00:28:08,400 --> 00:28:14,800
And then in the next trial, you
discover that, in this case,

469
00:28:14,800 --> 00:28:19,210
you have-- I'm sorry.

470
00:28:19,210 --> 00:28:22,330
I had 204 first as
the first trial,

471
00:28:22,330 --> 00:28:26,110
and then in the second
trial I had 586.

472
00:28:26,110 --> 00:28:28,790
And I would continue
beyond the second trial

473
00:28:28,790 --> 00:28:36,080
and get to third trial, and in
fact return 496 in this case.

474
00:28:36,080 --> 00:28:39,752
I would get to returning
496 in my third trial, which

475
00:28:39,752 --> 00:28:40,710
is exactly what I want.

476
00:28:43,780 --> 00:28:46,810
The interesting thing here is
that you can reuse storage.

477
00:28:46,810 --> 00:28:48,850
I mean the whole
point of deleting

478
00:28:48,850 --> 00:28:53,880
is that you can take the storage
and insert other keys in there.

479
00:28:53,880 --> 00:28:56,140
Once you've freed
up the storage.

480
00:28:56,140 --> 00:29:01,780
And you can do that by
making insert treat delete me

481
00:29:01,780 --> 00:29:03,565
the same as the none.

482
00:29:03,565 --> 00:29:05,190
So the next time you
want to insert you

483
00:29:05,190 --> 00:29:09,620
could-- if you happen to index
into the index corresponding

484
00:29:09,620 --> 00:29:12,650
to 586, you can override that.

485
00:29:12,650 --> 00:29:15,920
The delete me flag goes
away, and some other key--

486
00:29:15,920 --> 00:29:20,740
call it 999 or something--
would get in there.

487
00:29:20,740 --> 00:29:23,700
And you're all set with that.

488
00:29:23,700 --> 00:29:24,540
OK?

489
00:29:24,540 --> 00:29:26,380
Any questions?

490
00:29:26,380 --> 00:29:28,530
This all makes sense?

491
00:29:28,530 --> 00:29:33,050
So you could imagine coding
this up with an array structure

492
00:29:33,050 --> 00:29:35,100
is fairly straightforward.

493
00:29:35,100 --> 00:29:38,890
What remains here
to be discussed

494
00:29:38,890 --> 00:29:42,170
is how well does
this work, right?

495
00:29:42,170 --> 00:29:46,270
You have this extra requirement
on the hash function

496
00:29:46,270 --> 00:29:50,930
corresponding to creating
an extra argument

497
00:29:50,930 --> 00:29:53,950
as an input to it, which
is this trial count.

498
00:29:53,950 --> 00:29:57,200
And you'd like to have this
nice property of corresponding

499
00:29:57,200 --> 00:29:58,340
to a permutation.

500
00:29:58,340 --> 00:30:01,150
Can we actually design
hash functions like this?

501
00:30:01,150 --> 00:30:03,380
And we'll take a look
at a bad hash function,

502
00:30:03,380 --> 00:30:05,600
and then at a better one.

503
00:30:05,600 --> 00:30:08,260
So let's talk about
probing strategies, which

504
00:30:08,260 --> 00:30:15,910
is essentially the same
as taking a hash function

505
00:30:15,910 --> 00:30:18,570
and changing it
so it is actually

506
00:30:18,570 --> 00:30:21,240
applicable to open addressing.

507
00:30:21,240 --> 00:30:30,480
So the notion of
linear probing is

508
00:30:30,480 --> 00:30:40,920
that you do h k i
equals h prime k, which

509
00:30:40,920 --> 00:30:43,220
is some hash function
that you've chosen,

510
00:30:43,220 --> 00:30:49,585
plus i mod m, where this is
an ordinary hash function.

511
00:30:54,620 --> 00:30:55,460
OK?

512
00:30:55,460 --> 00:30:57,001
So that looks pretty
straightforward.

513
00:31:01,280 --> 00:31:02,100
What happens here?

514
00:31:02,100 --> 00:31:05,220
Does this satisfy the
permutation argument?

515
00:31:08,785 --> 00:31:10,500
Before I forget.

516
00:31:10,500 --> 00:31:13,680
Does it satisfy the
permutation property

517
00:31:13,680 --> 00:31:19,800
that I want h k 1, h k 2, h k
m minus 1 to be a permutation?

518
00:31:19,800 --> 00:31:20,580
That make sense?

519
00:31:20,580 --> 00:31:21,380
Yep, yep.

520
00:31:21,380 --> 00:31:23,240
Because I then I start adding.

521
00:31:23,240 --> 00:31:26,780
The mod is precisely kind
of this round robin cycle,

522
00:31:26,780 --> 00:31:28,780
so it's going to
satisfy the permutation.

523
00:31:28,780 --> 00:31:29,320
That's good.

524
00:31:34,120 --> 00:31:37,170
What's wrong with this?

525
00:31:37,170 --> 00:31:39,620
What's wrong with this?

526
00:31:39,620 --> 00:31:40,120
Someone?

527
00:31:43,120 --> 00:31:47,620
AUDIENCE: The fact that
[INAUDIBLE] keys, which they're

528
00:31:47,620 --> 00:31:50,620
all filled, then if you hit
anywhere in here [INAUDIBLE]

529
00:31:50,620 --> 00:31:51,974
list of consecutive keys.

530
00:31:51,974 --> 00:31:52,640
AUDIENCE: Right.

531
00:31:52,640 --> 00:31:53,390
That's excellent.

532
00:31:53,390 --> 00:31:54,740
Excellent, excellent answer.

533
00:31:54,740 --> 00:31:59,390
So this notion of
clustering is basically

534
00:31:59,390 --> 00:32:01,370
what's wrong with
this probing strategy.

535
00:32:01,370 --> 00:32:05,430
And in fact, I'm not going to
do this particular analysis,

536
00:32:05,430 --> 00:32:10,820
but I'll give you a sense of why
the statement I'm going to make

537
00:32:10,820 --> 00:32:11,760
is true.

538
00:32:11,760 --> 00:32:13,840
But the notion of
clustering is that you

539
00:32:13,840 --> 00:32:18,530
start getting consecutive
groups of occupied slots, OK?

540
00:32:27,850 --> 00:32:28,780
Which keep growing.

541
00:32:32,820 --> 00:32:36,780
And so these clusters
get longer and longer.

542
00:32:36,780 --> 00:32:38,950
And if you have a
big cluster, it's

543
00:32:38,950 --> 00:32:41,020
more likely to
grow bigger, right?

544
00:32:41,020 --> 00:32:41,840
Which is bad.

545
00:32:41,840 --> 00:32:44,879
This is exactly the wrong thing
for load balancing, right?

546
00:32:44,879 --> 00:32:47,170
And clustering is the reverse
of load balancing, right?

547
00:32:47,170 --> 00:32:48,970
If you have a bunch
of clumps and you

548
00:32:48,970 --> 00:32:52,101
have a bunch of empty space
in your table, that's bad.

549
00:32:52,101 --> 00:32:52,600
Right?

550
00:32:52,600 --> 00:32:54,100
The problem with
linear probing is

551
00:32:54,100 --> 00:32:57,940
that once you start getting a
cluster, given the, let's say,

552
00:32:57,940 --> 00:33:00,110
the randomness in the hash
function, and h prime k

553
00:33:00,110 --> 00:33:03,470
is a pretty good hash function
and can randomly go anywhere.

554
00:33:03,470 --> 00:33:07,140
Well, if you have 100 slots and
you have a cluster of size 4,

555
00:33:07,140 --> 00:33:10,900
well there's a for 4/100
chance, which is obviously

556
00:33:10,900 --> 00:33:15,050
four times greater than
1/100, even I can do that,

557
00:33:15,050 --> 00:33:17,760
to go into those four slots.

558
00:33:17,760 --> 00:33:19,480
And if you going
into those four slots

559
00:33:19,480 --> 00:33:22,440
you're going to keep
going down to the bottom,

560
00:33:22,440 --> 00:33:27,500
and you're going to make that
a cluster of size five, right?

561
00:33:27,500 --> 00:33:30,520
So that's the problem
the linear probing,

562
00:33:30,520 --> 00:33:34,290
and you can essentially
argue through making

563
00:33:34,290 --> 00:33:40,250
some probabilistic assumptions
that if, in fact, you

564
00:33:40,250 --> 00:33:47,040
use linear probing that you
lose your average constant time

565
00:33:47,040 --> 00:33:51,760
look up in your hash table
for most load factors.

566
00:33:51,760 --> 00:33:54,900
So what's happening out
here pictorially really

567
00:33:54,900 --> 00:33:57,870
is that you have a table and
let's say you have a cluster.

568
00:34:02,060 --> 00:34:03,460
And this is your cluster.

569
00:34:06,220 --> 00:34:10,440
So if your h k 1--
it doesn't really

570
00:34:10,440 --> 00:34:15,679
matter what it is-- but h
k i maps to this cluster,

571
00:34:15,679 --> 00:34:18,679
then you're going
to-- linear probing

572
00:34:18,679 --> 00:34:21,239
says that the next thing
you're going to try

573
00:34:21,239 --> 00:34:24,544
is if you map to
42 in the cluster,

574
00:34:24,544 --> 00:34:25,960
the next thing
you're going to try

575
00:34:25,960 --> 00:34:32,370
is 43, 44, until you get maybe
to this slot here, which is 57,

576
00:34:32,370 --> 00:34:34,020
for argument's sake.

577
00:34:34,020 --> 00:34:34,520
Right?

578
00:34:34,520 --> 00:34:36,228
So you're going to
keep going, and you're

579
00:34:36,228 --> 00:34:41,300
going to try 15 times in
this relatively dumb fashion

580
00:34:41,300 --> 00:34:45,730
to go down to get to the
open slot, which is 57.

581
00:34:45,730 --> 00:34:47,840
And oh, by the way,
at the end of this you

582
00:34:47,840 --> 00:34:51,159
just increased your
cluster length by one.

583
00:34:51,159 --> 00:34:51,969
All right?

584
00:34:51,969 --> 00:34:53,820
So it doesn't really work.

585
00:34:53,820 --> 00:34:58,790
And in fact, under reasonable
probabilistic assumptions

586
00:34:58,790 --> 00:35:01,780
in terms of what your
hash functions are,

587
00:35:01,780 --> 00:35:07,850
you can say that when you have
alpha, which is essentially

588
00:35:07,850 --> 00:35:15,613
your load factor, which is
n over m less than 0.99,

589
00:35:15,613 --> 00:35:24,840
you see clusters
of size log n, OK?

590
00:35:24,840 --> 00:35:25,470
Right.

591
00:35:25,470 --> 00:35:28,520
So this is a
probabilistic argument,

592
00:35:28,520 --> 00:35:30,879
and you're assuming that you
have a hash function that's

593
00:35:30,879 --> 00:35:32,045
a pretty good hash function.

594
00:35:32,045 --> 00:35:36,680
So h prime k can be this perfect
hash function, all right?

595
00:35:36,680 --> 00:35:39,060
So there's a problem here
beyond the choice of h

596
00:35:39,060 --> 00:35:42,010
prime k, which is this hash
function that worked really

597
00:35:42,010 --> 00:35:44,080
well for chaining.

598
00:35:44,080 --> 00:35:44,630
All right?

599
00:35:44,630 --> 00:35:49,410
And the problem here is the
linear probing aspect of it.

600
00:35:49,410 --> 00:35:50,570
So what does that mean?

601
00:35:50,570 --> 00:35:53,590
If you have clusters
of theta log n,

602
00:35:53,590 --> 00:35:56,830
then your search and
your insert are not

603
00:35:56,830 --> 00:35:58,350
going to be constant
time anymore.

604
00:35:58,350 --> 00:35:58,850
Right?

605
00:35:58,850 --> 00:36:02,180
Which is bad in a
probabilistic sense.

606
00:36:02,180 --> 00:36:04,080
OK?

607
00:36:04,080 --> 00:36:06,010
So how do we fix that?

608
00:36:06,010 --> 00:36:14,590
Well, one strategy that
works reasonably well

609
00:36:14,590 --> 00:36:15,660
is called double hashing.

610
00:36:18,590 --> 00:36:23,120
And it literally
means what it says.

611
00:36:23,120 --> 00:36:26,970
You have to run a
couple of hashes.

612
00:36:26,970 --> 00:36:37,270
And so the notion of double
hashing is that you have h k i

613
00:36:37,270 --> 00:36:47,910
equals h1 k plus i h2 k mod m.

614
00:36:47,910 --> 00:36:51,310
And h1 and h2 are just
ordinary hash functions.

615
00:36:51,310 --> 00:36:53,140
OK?

616
00:36:53,140 --> 00:36:56,000
Now the first thing
that we need to do

617
00:36:56,000 --> 00:37:01,886
is figure out how we can
guarantee a permutation, right?

618
00:37:01,886 --> 00:37:03,510
Because we still have
that requirement,

619
00:37:03,510 --> 00:37:05,570
and it was OK for the
linear probing part,

620
00:37:05,570 --> 00:37:07,270
but you still have
this requirement

621
00:37:07,270 --> 00:37:09,770
that you need a permutation.

622
00:37:09,770 --> 00:37:15,770
And so those of you who
are into number theory,

623
00:37:15,770 --> 00:37:24,560
can you tell me what property,
what neat property of h2 and m

624
00:37:24,560 --> 00:37:28,150
can we ask for to
guarantee a permutation?

625
00:37:28,150 --> 00:37:30,124
Do you have a question?

626
00:37:30,124 --> 00:37:31,310
You already do.

627
00:37:31,310 --> 00:37:34,520
Do you have a question?

628
00:37:34,520 --> 00:37:35,980
AUDIENCE: [INAUDIBLE].

629
00:37:35,980 --> 00:37:36,720
PROFESSOR: [INAUDIBLE]
relatively prime.

630
00:37:36,720 --> 00:37:37,460
OK, good.

631
00:37:37,460 --> 00:37:39,320
So I figured some of
you knew the answer,

632
00:37:39,320 --> 00:37:42,010
but I've seen you before.

633
00:37:42,010 --> 00:37:42,710
Right.

634
00:37:42,710 --> 00:37:43,300
Exactly right.

635
00:37:43,300 --> 00:37:45,300
Relatively prime.

636
00:37:45,300 --> 00:37:47,950
Just hand it to Victor.

637
00:37:47,950 --> 00:37:52,600
So h2 k and m being
relatively prime,

638
00:37:52,600 --> 00:38:05,715
if that implies a permutation.

639
00:38:08,592 --> 00:38:10,050
It's similar to
what we had before.

640
00:38:10,050 --> 00:38:13,217
You're multiplying this
by i. i keeps increasing,

641
00:38:13,217 --> 00:38:14,550
and you're going to roll around.

642
00:38:14,550 --> 00:38:14,900
All right?

643
00:38:14,900 --> 00:38:16,316
I mean you could
do a proof of it,

644
00:38:16,316 --> 00:38:18,220
but I'm not going to bother.

645
00:38:18,220 --> 00:38:20,720
The important thing
here is that you can now

646
00:38:20,720 --> 00:38:24,760
do something as simple as
m equals 2 raised to r,

647
00:38:24,760 --> 00:38:33,620
and h2 k for all k is odd,
and now you're in great shape.

648
00:38:33,620 --> 00:38:36,250
You can have your
array to be 2 raised

649
00:38:36,250 --> 00:38:39,090
to something, which is
what you really want.

650
00:38:39,090 --> 00:38:41,360
And you just use h2 k.

651
00:38:41,360 --> 00:38:43,390
You could even take a
regular hash function

652
00:38:43,390 --> 00:38:48,800
and truncate it to
make sure it's odd.

653
00:38:48,800 --> 00:38:50,140
You can do a bunch of things.

654
00:38:50,140 --> 00:38:52,980
There's hash functions
that produce odd values,

655
00:38:52,980 --> 00:38:54,380
and you can use that.

656
00:38:54,380 --> 00:38:55,180
All right?

657
00:38:55,180 --> 00:38:58,560
And so double hashing works
fairly well in practice.

658
00:38:58,560 --> 00:39:05,290
It's a good way of getting
open addressing to work.

659
00:39:05,290 --> 00:39:08,810
And in order to prove that
open addressing actually

660
00:39:08,810 --> 00:39:14,200
works to the level at
which chaining works,

661
00:39:14,200 --> 00:39:18,380
we have to make an
assumption corresponding

662
00:39:18,380 --> 00:39:20,960
to uniform hashing.

663
00:39:20,960 --> 00:39:25,390
And I'm not going to
actually do a proof,

664
00:39:25,390 --> 00:39:27,320
but it'll be in the notes.

665
00:39:27,320 --> 00:39:33,720
But I do want to talk about
the theorem and the result

666
00:39:33,720 --> 00:39:38,320
that the theorem
implies, assuming

667
00:39:38,320 --> 00:39:40,700
you have the uniform
hashing assumption.

668
00:39:40,700 --> 00:39:43,580
And let me first
say that this is not

669
00:39:43,580 --> 00:39:49,920
the same as simple
uniform happening, which

670
00:39:49,920 --> 00:39:54,410
talks about the independence of
keys in terms of their mapping

671
00:39:54,410 --> 00:39:55,650
to slots.

672
00:39:55,650 --> 00:39:57,980
The uniform hashing
assumption says

673
00:39:57,980 --> 00:40:11,230
that each key is
equally likely to have

674
00:40:11,230 --> 00:40:19,250
any one of the m
factorial permutations--

675
00:40:19,250 --> 00:40:21,020
so we're talking about
random permutations

676
00:40:21,020 --> 00:40:24,780
here-- as its probe sequence.

677
00:40:31,080 --> 00:40:31,650
All right?

678
00:40:31,650 --> 00:40:33,930
This is very hard
to get in practice.

679
00:40:33,930 --> 00:40:38,110
You can get pretty close
using double hashing.

680
00:40:38,110 --> 00:40:41,120
But nobody's discovered
a perfect hash function,

681
00:40:41,120 --> 00:40:44,572
deterministic hash function
that satisfies this property.

682
00:40:44,572 --> 00:40:45,780
At least not that I know off.

683
00:40:48,290 --> 00:40:49,380
So what does this imply?

684
00:40:49,380 --> 00:40:53,340
Assuming that you have
this and double hatching

685
00:40:53,340 --> 00:40:59,180
gives you this property, to a
large extent what this means is

686
00:40:59,180 --> 00:41:03,170
that if alpha is
n over m, you can

687
00:41:03,170 --> 00:41:18,280
show that the cost of operations
such as search, insert, delete,

688
00:41:18,280 --> 00:41:19,690
et cetera.

689
00:41:19,690 --> 00:41:22,740
And in particular
we talk about insert

690
00:41:22,740 --> 00:41:27,210
is less than or equal to 1
divided by 1 minus alpha.

691
00:41:27,210 --> 00:41:29,150
OK?

692
00:41:29,150 --> 00:41:33,650
So obviously this goes
as alpha tends to 1.

693
00:41:33,650 --> 00:41:40,990
As alpha tends to 1, the load
factor in the table gets large,

694
00:41:40,990 --> 00:41:44,180
and the number of
expected probes

695
00:41:44,180 --> 00:41:47,920
that you need to do when
you get an insert grows.

696
00:41:47,920 --> 00:41:52,130
And if alpha is 0.99,
you're going, on average,

697
00:41:52,130 --> 00:41:54,200
require 100 probes.

698
00:41:54,200 --> 00:41:56,960
It's a constant number, but
it's a pretty bad constant.

699
00:41:56,960 --> 00:41:57,460
Right?

700
00:41:57,460 --> 00:42:01,050
So you really want alpha
to be fairly small.

701
00:42:01,050 --> 00:42:03,130
And in practice it
turns out that you

702
00:42:03,130 --> 00:42:05,720
have to re-size you're
open addressing table

703
00:42:05,720 --> 00:42:10,190
when alpha gets beyond
about 0.5, 0.6 or so,

704
00:42:10,190 --> 00:42:13,132
because by then you're
really in trouble.

705
00:42:13,132 --> 00:42:15,340
Remember this is an average
case we're talking about.

706
00:42:15,340 --> 00:42:18,250
All of this is using a
probabilistic assumption.

707
00:42:18,250 --> 00:42:21,780
But as you get to
high alphas, suddenly

708
00:42:21,780 --> 00:42:24,720
by the time you get to
0.7, open addressing

709
00:42:24,720 --> 00:42:28,930
doesn't work well in relation
to an equivalent table

710
00:42:28,930 --> 00:42:32,460
with the overall
number of slots that

711
00:42:32,460 --> 00:42:35,190
correspond to a
changing table, OK?

712
00:42:35,190 --> 00:42:39,020
So open addressing
is easy to implement.

713
00:42:39,020 --> 00:42:42,170
It uses less memory because
you don't need pointers.

714
00:42:42,170 --> 00:42:47,370
But you better be careful that
your alpha stays around 0.5

715
00:42:47,370 --> 00:42:48,480
and no more.

716
00:42:48,480 --> 00:42:50,880
So all that means is
you can still use it.

717
00:42:50,880 --> 00:42:52,547
You just have to
re-size your table.

718
00:42:52,547 --> 00:42:54,130
You have slightly
different strategies

719
00:42:54,130 --> 00:42:56,430
for resizing your
table when you use open

720
00:42:56,430 --> 00:43:03,580
addressing as opposed
to chaining hash tables.

721
00:43:03,580 --> 00:43:04,350
All right?

722
00:43:04,350 --> 00:43:06,130
So that's a summary
of open addressing.

723
00:43:06,130 --> 00:43:09,392
I want to spend some time
on cryptographic hashes

724
00:43:09,392 --> 00:43:10,600
in the time that I have left.

725
00:43:10,600 --> 00:43:12,380
I guess I have a
few minutes left.

726
00:43:12,380 --> 00:43:15,940
But any questions
about open addressing?

727
00:43:15,940 --> 00:43:17,114
Yep?

728
00:43:17,114 --> 00:43:18,875
AUDIENCE: On this
delete part, what's

729
00:43:18,875 --> 00:43:21,570
going to happen if, say, you
fill the table up and then

730
00:43:21,570 --> 00:43:24,020
delete everything, and
then you start searching.

731
00:43:24,020 --> 00:43:26,143
Isn't that going to
be bad because it's

732
00:43:26,143 --> 00:43:27,601
going to search
through everything?

733
00:43:27,601 --> 00:43:29,680
PROFESSOR: So that's right.

734
00:43:29,680 --> 00:43:31,210
The bad thing about
open addressing

735
00:43:31,210 --> 00:43:34,990
is that delete isn't
instantaneous, right?

736
00:43:34,990 --> 00:43:37,890
In the sense that if you deleted
something from the link list

737
00:43:37,890 --> 00:43:40,000
in your chaining
table, then even

738
00:43:40,000 --> 00:43:43,470
if you went to that same
thing, the chain got smaller,

739
00:43:43,470 --> 00:43:46,850
and that helps you, because
your table now has lower load.

740
00:43:46,850 --> 00:43:49,990
But there's a delay
associated with load

741
00:43:49,990 --> 00:43:52,130
when you have the
delete me flag.

742
00:43:52,130 --> 00:43:52,630
OK?

743
00:43:52,630 --> 00:43:56,610
So in some sense the alpha
that you want to think about,

744
00:43:56,610 --> 00:43:59,816
you should be careful as
to how you define alpha.

745
00:43:59,816 --> 00:44:01,190
And that's one of
the reasons why

746
00:44:01,190 --> 00:44:03,874
when you get alpha
being 0.5, 0.6

747
00:44:03,874 --> 00:44:06,290
you get into trouble, because
if you have all these delete

748
00:44:06,290 --> 00:44:09,080
me flags, they're
still hurting you.

749
00:44:09,080 --> 00:44:10,699
AUDIENCE: And when
you resize do those

750
00:44:10,699 --> 00:44:12,669
delete me flags get deleted?

751
00:44:12,669 --> 00:44:14,210
PROFESSOR: When you
completely resize

752
00:44:14,210 --> 00:44:15,720
and you redo the
whole thing, then you

753
00:44:15,720 --> 00:44:17,928
can clean up the delete me's
and turn them into nones

754
00:44:17,928 --> 00:44:22,210
because you're rehashing it.

755
00:44:22,210 --> 00:44:22,850
All right.

756
00:44:22,850 --> 00:44:24,340
So yeah, back there.

757
00:44:24,340 --> 00:44:24,840
Question?

758
00:44:24,840 --> 00:44:26,530
AUDIENCE: Yes, can you explain
how you got the equation

759
00:44:26,530 --> 00:44:28,747
that the cost of operation
insert is less than

760
00:44:28,747 --> 00:44:30,994
or equal to 1 over [INAUDIBLE].

761
00:44:30,994 --> 00:44:32,410
PROFESSOR: That's
a longish proof,

762
00:44:32,410 --> 00:44:36,630
but let me explain to
you how that comes out.

763
00:44:36,630 --> 00:44:39,370
Basically the intuition
behind the proof

764
00:44:39,370 --> 00:44:45,080
is that we're going to
assume some probability p.

765
00:44:45,080 --> 00:44:48,410
And initially you're
going to say something

766
00:44:48,410 --> 00:44:58,080
like if the table, your p--
I'll just write this out here--

767
00:44:58,080 --> 00:45:02,300
is m minus n divided by m.

768
00:45:02,300 --> 00:45:03,350
So what is that?

769
00:45:03,350 --> 00:45:06,620
Right now I have n
elements in the table,

770
00:45:06,620 --> 00:45:12,390
and I have m slots, OK?

771
00:45:12,390 --> 00:45:17,530
So the probability that my very
first trial is going to succeed

772
00:45:17,530 --> 00:45:22,360
is going to be m minus n
divided by m, because these

773
00:45:22,360 --> 00:45:24,250
are the number of empty slots.

774
00:45:24,250 --> 00:45:26,580
And assuming my
permutation argument,

775
00:45:26,580 --> 00:45:28,240
I could go into one of them.

776
00:45:28,240 --> 00:45:30,260
And so that's what I have here.

777
00:45:30,260 --> 00:45:36,010
And if you look at what this
is, this is 1 minus alpha, OK?

778
00:45:36,010 --> 00:45:38,470
And so then you run
off and you remember

779
00:45:38,470 --> 00:45:41,165
6041 or the high school
probability course

780
00:45:41,165 --> 00:45:44,380
that you take, and you
say generally speaking,

781
00:45:44,380 --> 00:45:47,470
you're going to be no worse
than p for every trial.

782
00:45:47,470 --> 00:45:49,840
And so if you assume
the worst and say

783
00:45:49,840 --> 00:45:52,390
every trial has a
probability of success of p,

784
00:45:52,390 --> 00:45:56,040
the expected number
of trials is 1/p, OK?

785
00:45:56,040 --> 00:46:00,080
And that's how you got
the 1 over 1 minus alpha.

786
00:46:00,080 --> 00:46:04,030
So you'll see that written
in gory detail in the notes.

787
00:46:04,030 --> 00:46:05,030
All right?

788
00:46:05,030 --> 00:46:06,270
OK.

789
00:46:06,270 --> 00:46:08,370
Expected to have
a little more time

790
00:46:08,370 --> 00:46:11,380
in terms of talking about
cryptographic hashes,

791
00:46:11,380 --> 00:46:15,040
but cryptographic hashes are
not going to be on the quiz.

792
00:46:15,040 --> 00:46:19,920
This is purely material FYI.

793
00:46:19,920 --> 00:46:22,160
For your interest only.

794
00:46:22,160 --> 00:46:24,580
And again I have
some notes on it,

795
00:46:24,580 --> 00:46:28,390
but I want to give you a sense
of the other kinds of hashes

796
00:46:28,390 --> 00:46:34,370
that exist in the
world, I guess.

797
00:46:34,370 --> 00:46:39,850
And hashes that are used for
many different applications.

798
00:46:39,850 --> 00:46:42,070
So maybe the best way
of motivating this

799
00:46:42,070 --> 00:46:43,990
is through an example.

800
00:46:43,990 --> 00:46:46,880
So let's talk about
an example that

801
00:46:46,880 --> 00:46:51,280
is near and dear to every
security person's heart

802
00:46:51,280 --> 00:46:55,050
and probably to people who
aren't interested in security

803
00:46:55,050 --> 00:46:58,620
as well, which is
password storage.

804
00:46:58,620 --> 00:47:01,750
So think about how,
let's say, Unix systems

805
00:47:01,750 --> 00:47:04,650
work when you type
in your password.

806
00:47:04,650 --> 00:47:06,650
You're typing in your
password [INAUDIBLE],

807
00:47:06,650 --> 00:47:09,460
and this is true for
other systems as well,

808
00:47:09,460 --> 00:47:11,650
but you have a password.

809
00:47:11,650 --> 00:47:16,470
And my password is a permutation
of my first daughters

810
00:47:16,470 --> 00:47:18,910
first name.

811
00:47:18,910 --> 00:47:21,040
[LAUGHTER]

812
00:47:21,040 --> 00:47:24,880
Yeah, but haven't
given it away, right?

813
00:47:24,880 --> 00:47:27,290
Haven't given it away.

814
00:47:27,290 --> 00:47:29,510
And so this password
is something

815
00:47:29,510 --> 00:47:33,430
that I'm typing in
every day, right?

816
00:47:33,430 --> 00:47:36,760
Now the sum check
that needs to happen

817
00:47:36,760 --> 00:47:40,660
to ensure that I'm typing
in the right password.

818
00:47:40,660 --> 00:47:43,610
So what is a dumb
way of doing things.

819
00:47:43,610 --> 00:47:46,210
What's a dumb way
of building systems?

820
00:47:46,210 --> 00:47:49,510
AUDIENCE: Storing [INAUDIBLE].

821
00:47:49,510 --> 00:47:52,522
PROFESSOR: This is
kind of a freebie.

822
00:47:52,522 --> 00:47:54,235
AUDIENCE: [INAUDIBLE].

823
00:47:54,235 --> 00:47:55,360
PROFESSOR: In situ hashing.

824
00:47:55,360 --> 00:47:58,710
That's better.

825
00:47:58,710 --> 00:48:00,010
So you'd store it.

826
00:48:00,010 --> 00:48:01,070
I offered the dumb way.

827
00:48:01,070 --> 00:48:03,230
So there's a perfectly
valid answer.

828
00:48:03,230 --> 00:48:06,450
So you could clearly store
this in plain text in some file

829
00:48:06,450 --> 00:48:09,720
and you could call it
slash etc slaw password.

830
00:48:09,720 --> 00:48:14,200
And you could make it
read for the work, right?

831
00:48:14,200 --> 00:48:17,290
And that'd be great, and
people do that, right?

832
00:48:17,290 --> 00:48:19,770
But what you would
rather do is you

833
00:48:19,770 --> 00:48:24,580
want to make sure that even
the sysadmin doesn't know

834
00:48:24,580 --> 00:48:27,630
my password or your
password, right?

835
00:48:27,630 --> 00:48:29,140
So how do you do that?

836
00:48:29,140 --> 00:48:32,110
Well you do that using a
cryptographic hash that

837
00:48:32,110 --> 00:48:36,400
has this interesting
property that is one way, OK?

838
00:48:36,400 --> 00:48:42,370
And what that means is
that given h of x-- OK,

839
00:48:42,370 --> 00:48:45,460
this is the value
of the hash-- it

840
00:48:45,460 --> 00:48:55,620
is very hard to find the
x such that x basically

841
00:48:55,620 --> 00:48:56,790
hashes to this value.

842
00:48:56,790 --> 00:49:02,380
So if h of x equals
let's call it q,

843
00:49:02,380 --> 00:49:08,910
then you're only given h of x.

844
00:49:08,910 --> 00:49:11,750
And so what do you do now?

845
00:49:11,750 --> 00:49:13,360
Well, it's beautiful.

846
00:49:13,360 --> 00:49:16,710
Assuming you have this one way
hash, this cryptographic hash,

847
00:49:16,710 --> 00:49:23,110
in your etc slash
password file, you

848
00:49:23,110 --> 00:49:31,780
have something like
login name, [INAUDIBLE],

849
00:49:31,780 --> 00:49:35,450
which happens to be the hash
of my daughter's first name,

850
00:49:35,450 --> 00:49:36,530
or something.

851
00:49:36,530 --> 00:49:41,000
But this is what's stored
in there and the same thing

852
00:49:41,000 --> 00:49:43,140
for a bunch of
different users, right?

853
00:49:43,140 --> 00:49:46,970
So when I log in and I type
in the actual password,

854
00:49:46,970 --> 00:49:48,670
what does the system do?

855
00:49:48,670 --> 00:49:51,120
What does the system do?

856
00:49:51,120 --> 00:49:52,130
It hashes it.

857
00:49:52,130 --> 00:50:00,300
It takes x prime, which is
the typed in password, which

858
00:50:00,300 --> 00:50:04,307
may or may not be
equal to my password,

859
00:50:04,307 --> 00:50:06,390
because somebody else might
be trying to break in,

860
00:50:06,390 --> 00:50:11,520
or I just mistyped, or forgot
my daughter's first name,

861
00:50:11,520 --> 00:50:13,250
which would be bad.

862
00:50:13,250 --> 00:50:18,700
And it will just check to see--
it doesn't need x, because it's

863
00:50:18,700 --> 00:50:23,650
stored h of x in the system,
so it doesn't need x.

864
00:50:23,650 --> 00:50:27,300
So if we just compare
against what I typed in,

865
00:50:27,300 --> 00:50:28,830
it would compute the hash again.

866
00:50:28,830 --> 00:50:33,700
And then would let me in
assuming that these things

867
00:50:33,700 --> 00:50:36,530
matched and would not
let me in if it didn't.

868
00:50:36,530 --> 00:50:39,060
So now we can talk about-- and
I don't have time for this,

869
00:50:39,060 --> 00:50:41,835
but you can certainly
read up on it on Wikipedia

870
00:50:41,835 --> 00:50:43,344
and a bunch in the notes.

871
00:50:43,344 --> 00:50:44,760
You can talk about
what properties

872
00:50:44,760 --> 00:50:48,240
should this hash function
have, namely one way collision

873
00:50:48,240 --> 00:50:50,950
resistance, in order
to solve these problems

874
00:50:50,950 --> 00:50:52,020
and other problems.

875
00:50:52,020 --> 00:50:54,770
I'm happy to stick around
and answer questions.