1
00:00:00,090 --> 00:00:02,490
The following content is
provided under a Creative

2
00:00:02,490 --> 00:00:04,030
Commons license.

3
00:00:04,030 --> 00:00:06,360
Your support will help
MIT OpenCourseWare

4
00:00:06,360 --> 00:00:10,720
continue to offer high quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,320
To make a donation or
view additional materials

6
00:00:13,320 --> 00:00:17,280
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,280 --> 00:00:18,450
at ocw.mit.edu.

8
00:00:20,940 --> 00:00:22,440
ERIK DEMAINE: All
right, today we're

9
00:00:22,440 --> 00:00:26,550
going to do an exciting
topic, which is hashing.

10
00:00:26,550 --> 00:00:28,490
Do it all in one
lecture, that's the plan.

11
00:00:28,490 --> 00:00:31,740
See if we make it.

12
00:00:31,740 --> 00:00:33,450
You've probably
heard about hashing.

13
00:00:33,450 --> 00:00:35,970
It's probably the most
common data structure

14
00:00:35,970 --> 00:00:37,050
in computer science.

15
00:00:37,050 --> 00:00:39,804
It's covered in pretty much
every algorithms class.

16
00:00:39,804 --> 00:00:41,220
But there's a lot
to say about it.

17
00:00:41,220 --> 00:00:42,810
And I want to
quickly review things

18
00:00:42,810 --> 00:00:45,360
you might know and then
quickly get to things

19
00:00:45,360 --> 00:00:47,335
you shouldn't know.

20
00:00:47,335 --> 00:00:48,960
And we're going to
talk on the one hand

21
00:00:48,960 --> 00:00:51,930
about different kinds
of hash functions,

22
00:00:51,930 --> 00:00:53,790
fancy stuff like
k-wise independence,

23
00:00:53,790 --> 00:00:56,040
and a new technique that's
been analyzed a lot lately,

24
00:00:56,040 --> 00:00:58,819
simple tabulation hashing,
just in the last year.

25
00:00:58,819 --> 00:01:00,360
And then we'll look
at different ways

26
00:01:00,360 --> 00:01:03,240
to use this hash function that
actually build data structures,

27
00:01:03,240 --> 00:01:05,760
chaining is the obvious
one; perfect hashing you

28
00:01:05,760 --> 00:01:09,150
may have seen; linear probing
is another obvious one,

29
00:01:09,150 --> 00:01:11,100
but has only been
analyzed recently;

30
00:01:11,100 --> 00:01:15,690
and cuckoo hashing is a new one
that has its own fun feature.

31
00:01:15,690 --> 00:01:18,810
So that's where we're
going to go today

32
00:01:18,810 --> 00:01:22,440
Remember, the basic
idea of hashing

33
00:01:22,440 --> 00:01:25,350
is you want to reduce
a giant universe

34
00:01:25,350 --> 00:01:28,090
to a reasonably small table.

35
00:01:28,090 --> 00:01:30,640
So I'm going to call
our hash function h.

36
00:01:30,640 --> 00:01:35,438
I'm going to call the universe
integer 0 up to u minus 1.

37
00:01:35,438 --> 00:01:38,610
So this is the universe.

38
00:01:38,610 --> 00:01:42,600
And I'm going to denote the
universe by a capital U.

39
00:01:42,600 --> 00:01:46,839
And we have a table
that we'd like to store.

40
00:01:46,839 --> 00:01:48,630
I'm not going to draw
it yet because that's

41
00:01:48,630 --> 00:01:52,290
the second half of the lecture,
what is the table actually.

42
00:01:52,290 --> 00:01:56,080
But we'll just think of it as
indices 0 through m minus 1.

43
00:01:56,080 --> 00:02:01,000
So this is the table size.

44
00:02:01,000 --> 00:02:03,780
And probably, we want
m to be about n. n

45
00:02:03,780 --> 00:02:06,660
is the number of keys we're
actually storing in the table.

46
00:02:06,660 --> 00:02:10,289
But that's not necessarily
seen at this level.

47
00:02:10,289 --> 00:02:13,110
So that's a hash function.

48
00:02:13,110 --> 00:02:15,555
m is going to be
much smaller than u.

49
00:02:15,555 --> 00:02:16,930
We're just hashing
integers here,

50
00:02:16,930 --> 00:02:19,830
so if you don't have integers
you map your whatever space

51
00:02:19,830 --> 00:02:21,690
of things you have to integers.

52
00:02:21,690 --> 00:02:23,790
That's pretty much
always possible.

53
00:02:23,790 --> 00:02:28,000
Now, best case scenario would
be to use a totally random hash

54
00:02:28,000 --> 00:02:28,500
function.

55
00:02:31,590 --> 00:02:33,730
What does totally random mean?

56
00:02:33,730 --> 00:02:37,800
The probability, if you
choose your hash function,

57
00:02:37,800 --> 00:02:41,850
that any key x maps to
any particular slot--

58
00:02:41,850 --> 00:02:44,700
these table things
are called slots--

59
00:02:44,700 --> 00:02:47,640
is 1 over m.

60
00:02:47,640 --> 00:02:51,925
And this is
independent for all x.

61
00:03:08,290 --> 00:03:09,740
So this would be ideal.

62
00:03:09,740 --> 00:03:14,900
You choose each h of x for
every possible key randomly,

63
00:03:14,900 --> 00:03:16,120
independently.

64
00:03:16,120 --> 00:03:18,450
Then that gives you
perfect hashing.

65
00:03:18,450 --> 00:03:20,380
Not perfect in
this sense, sorry.

66
00:03:20,380 --> 00:03:22,600
It gives you ideal hashing.

67
00:03:22,600 --> 00:03:24,910
Perfect means no collisions.

68
00:03:24,910 --> 00:03:26,410
This actually might
have collisions,

69
00:03:26,410 --> 00:03:29,834
there's some chance that two
keys hash to the same value.

70
00:03:29,834 --> 00:03:31,000
We call this totally random.

71
00:03:31,000 --> 00:03:32,410
This is sort of the
ideal thing that we're

72
00:03:32,410 --> 00:03:34,840
trying to approximate with
reasonable hash functions.

73
00:03:34,840 --> 00:03:36,040
Why is this bad?

74
00:03:36,040 --> 00:03:37,990
Because it's big.

75
00:03:37,990 --> 00:03:39,820
The number of bits
of information

76
00:03:39,820 --> 00:03:43,480
you'd need if you actually
could flip all these coins,

77
00:03:43,480 --> 00:03:47,840
you'd need to write
down, I guess U times

78
00:03:47,840 --> 00:03:50,790
log m bits of information.

79
00:03:55,800 --> 00:03:57,330
Which is generally way too big.

80
00:03:57,330 --> 00:03:59,220
We can't afford U.
The whole point is

81
00:03:59,220 --> 00:04:03,880
we want to store n items much
smaller than U. Surprisingly,

82
00:04:03,880 --> 00:04:05,389
this concept will
still be useful.

83
00:04:05,389 --> 00:04:06,180
So we'll get there.

84
00:04:11,920 --> 00:04:18,870
Another system you've probably
seen is universal hashing.

85
00:04:18,870 --> 00:04:27,380
This is a constraint
on hash function.

86
00:04:27,380 --> 00:04:29,690
So this would be
ideally you'd choose h

87
00:04:29,690 --> 00:04:31,700
uniformly at random
from all hash functions.

88
00:04:31,700 --> 00:04:34,090
That would give you
this probability.

89
00:04:34,090 --> 00:04:36,440
We're going to make a
smaller set of hash functions

90
00:04:36,440 --> 00:04:38,100
whose size is much smaller.

91
00:04:38,100 --> 00:04:40,850
And so you can encode the hash
function in many fewer bits.

92
00:04:45,050 --> 00:04:47,960
And the property we want
from that hash family

93
00:04:47,960 --> 00:04:52,160
is that if you look
at the probability

94
00:04:52,160 --> 00:04:56,300
that two keys collide,
you get roughly what

95
00:04:56,300 --> 00:05:00,200
you expect from totally random.

96
00:05:00,200 --> 00:05:02,832
You would hope for 1 over m.

97
00:05:02,832 --> 00:05:04,415
Once you pick one
key, the probability

98
00:05:04,415 --> 00:05:08,110
the other key would hit
it would be 1 over m.

99
00:05:08,110 --> 00:05:11,030
But we'll allow constant factor.

100
00:05:11,030 --> 00:05:13,390
And also allow it to be smaller.

101
00:05:13,390 --> 00:05:15,077
It gives us some slop.

102
00:05:15,077 --> 00:05:16,160
You don't have to do this.

103
00:05:16,160 --> 00:05:18,500
If you don't do this you it's
called strongly universal.

104
00:05:21,445 --> 00:05:23,420
That's universal.

105
00:05:23,420 --> 00:05:25,460
And universal is enough
for a lot of things

106
00:05:25,460 --> 00:05:29,130
that you've probably seen, but
not enough for other things.

107
00:05:29,130 --> 00:05:31,040
So here are some examples
of hash functions

108
00:05:31,040 --> 00:05:34,655
that are universal, which
again, you may have seen.

109
00:05:49,880 --> 00:05:53,030
You can take a random
integer a, multiply it

110
00:05:53,030 --> 00:05:54,680
by x integer multiplication.

111
00:05:54,680 --> 00:05:57,140
You could also do this as
a vector wise dot product.

112
00:05:57,140 --> 00:05:59,840
But here I'm doing
it as multiplication.

113
00:05:59,840 --> 00:06:01,460
Modulo a prime.

114
00:06:01,460 --> 00:06:07,630
Prime has to be bigger than U,
maybe bigger or equal is fine.

115
00:06:13,220 --> 00:06:14,690
Universe.

116
00:06:14,690 --> 00:06:17,270
And then you take the
whole thing modulo em,

117
00:06:17,270 --> 00:06:20,690
Now, this is universal but
it loses a factor of 2 here,

118
00:06:20,690 --> 00:06:21,950
I believe, in general.

119
00:06:21,950 --> 00:06:26,060
Because you take things
modulo prime and then

120
00:06:26,060 --> 00:06:28,310
you take things modulo
whatever your table size is.

121
00:06:28,310 --> 00:06:30,970
If you set your table
size to p that's great.

122
00:06:30,970 --> 00:06:32,750
I think you get a factor of 1.

123
00:06:32,750 --> 00:06:39,170
If you don't, you're essentially
losing possibly half the slots,

124
00:06:39,170 --> 00:06:43,040
depending on how m and p
are related to each other.

125
00:06:43,040 --> 00:06:46,190
So it's OK, but not great.

126
00:06:48,610 --> 00:06:50,360
It's also considered
expensive because you

127
00:06:50,360 --> 00:06:54,290
have do all this division,
which people don't like to do.

128
00:06:54,290 --> 00:06:56,900
So there's a
fancier method which

129
00:06:56,900 --> 00:07:08,810
is a times x shifted right
by log u minus log m.

130
00:07:12,180 --> 00:07:18,809
This is when m and u are
powers of 2, which is

131
00:07:18,809 --> 00:07:20,100
the case we kind of care about.

132
00:07:20,100 --> 00:07:24,380
Usually your
universe is of size 2

133
00:07:24,380 --> 00:07:26,910
to the word size of your
machine, 2 to the 32,

134
00:07:26,910 --> 00:07:30,202
2 to the 64, however
bigger integers are.

135
00:07:30,202 --> 00:07:31,410
So it's usually a power of 2.

136
00:07:31,410 --> 00:07:33,180
It's fine to make your
table a power of 2.

137
00:07:33,180 --> 00:07:36,000
We're probably going
to use table doubling.

138
00:07:36,000 --> 00:07:40,680
So you just multiply and then
take the high order bits,

139
00:07:40,680 --> 00:07:42,760
that's what this is saying.

140
00:07:42,760 --> 00:07:46,200
So this is a method
more recent, 1997.

141
00:07:46,200 --> 00:07:49,500
Whereas this one
goes back to 1979.

142
00:07:49,500 --> 00:07:53,580
So '79, '97.

143
00:07:53,580 --> 00:07:57,071
And it's also universal.

144
00:07:57,071 --> 00:07:58,820
There's a lot of
universal hash functions.

145
00:07:58,820 --> 00:08:00,540
I'm not going to list them all.

146
00:08:00,540 --> 00:08:05,460
I'd rather get to stronger
properties than universality.

147
00:08:05,460 --> 00:08:08,330
So the next one is called
k-wise independence.

148
00:08:14,400 --> 00:08:17,780
This is harder to obtain.

149
00:08:17,780 --> 00:08:21,600
And it implies universality.

150
00:08:21,600 --> 00:08:27,270
So we want a family of
hash functions such that--

151
00:08:54,600 --> 00:08:56,720
Maybe let's start with
just pairwise independence,

152
00:08:56,720 --> 00:08:58,349
k equals 2.

153
00:08:58,349 --> 00:09:00,140
Then what this is saying
is the probability

154
00:09:00,140 --> 00:09:03,740
of your choice of hash function,
that the first key maps

155
00:09:03,740 --> 00:09:07,250
to this slot, t1, and
the second key maps

156
00:09:07,250 --> 00:09:09,920
to some other slot, t2.

157
00:09:09,920 --> 00:09:13,140
For any two keys x1 and xk.

158
00:09:16,070 --> 00:09:19,580
If your function was
random each of those

159
00:09:19,580 --> 00:09:22,070
happens with probability 1
over m, they're independent.

160
00:09:22,070 --> 00:09:26,025
So you get 1 over m to the
k, or 1 over m squared for k

161
00:09:26,025 --> 00:09:28,340
equals 2.

162
00:09:28,340 --> 00:09:30,590
Even in that situation
that's different from saying

163
00:09:30,590 --> 00:09:34,460
the probability of 2 keys
being equal is 1 over m.

164
00:09:34,460 --> 00:09:36,380
This would imply that.

165
00:09:36,380 --> 00:09:38,810
But here there could still
be some co-dependence

166
00:09:38,810 --> 00:09:40,320
between x and y.

167
00:09:40,320 --> 00:09:42,590
Here there essentially can't be.

168
00:09:42,590 --> 00:09:46,160
I mean, other than
this constant factor.

169
00:09:46,160 --> 00:09:49,130
Pairwise independence means
every two guys are independent,

170
00:09:49,130 --> 00:09:51,640
k-wise means every k
guys are independent

171
00:09:51,640 --> 00:09:55,010
up to the constant factor.

172
00:09:55,010 --> 00:10:00,740
So this is for distinct xi's.

173
00:10:03,770 --> 00:10:05,260
Obviously if two
of them are equal

174
00:10:05,260 --> 00:10:08,690
they're very likely to
hash to the same slot.

175
00:10:08,690 --> 00:10:11,830
So you've got to forbid that.

176
00:10:11,830 --> 00:10:14,510
OK, so an example of
such a hash function.

177
00:10:19,130 --> 00:10:22,820
Here we just took a product.

178
00:10:22,820 --> 00:10:35,180
In general you can take a
polynomial of degree k minus 1.

179
00:10:35,180 --> 00:10:36,630
Evaluate that mod p.

180
00:10:39,200 --> 00:10:47,279
And then if you want some,
modulo that to your table size.

181
00:10:47,279 --> 00:10:49,070
So in particular if k
equals 2, we actually

182
00:10:49,070 --> 00:10:49,945
have to do some work.

183
00:10:49,945 --> 00:10:55,740
This function is not pairwise
independent, it is universal.

184
00:10:55,740 --> 00:10:58,790
If you make it ax plus
b for random a and b,

185
00:10:58,790 --> 00:11:00,890
then this becomes
pairwise independent.

186
00:11:00,890 --> 00:11:04,340
In general, you want
three wise independent,

187
00:11:04,340 --> 00:11:10,040
triple wise independent, you
need ax squared plus bx plus c

188
00:11:10,040 --> 00:11:13,250
for random a, b's, and c's.

189
00:11:13,250 --> 00:11:19,713
So these are arbitrary numbers
between 0 and p, I guess.

190
00:11:25,140 --> 00:11:26,350
OK.

191
00:11:26,350 --> 00:11:29,870
This is also old, 1981.

192
00:11:29,870 --> 00:11:33,300
Wegman and Carter
introduced these two notions

193
00:11:33,300 --> 00:11:34,830
in a couple of different papers.

194
00:11:34,830 --> 00:11:37,620
This is an old idea.

195
00:11:37,620 --> 00:11:41,670
This is, of course, expensive
in that we pay order k time

196
00:11:41,670 --> 00:11:43,500
to evaluate it.

197
00:11:43,500 --> 00:11:46,050
Also, there's a lot
of multiplications

198
00:11:46,050 --> 00:11:47,820
and you have to do
everything modulo p.

199
00:11:47,820 --> 00:11:50,028
So a lot of people have
worked on more efficient ways

200
00:11:50,028 --> 00:11:51,780
to do k-wise independence.

201
00:11:51,780 --> 00:11:54,780
And there two main
results on this.

202
00:11:54,780 --> 00:11:59,830
Both of them achieve m to the
epsilon space, is not great.

203
00:12:11,840 --> 00:12:16,580
One of them the query
time depends on k,

204
00:12:16,580 --> 00:12:24,255
and it's uniform, and
reasonably practical

205
00:12:24,255 --> 00:12:26,450
[? with ?] experiments.

206
00:12:26,450 --> 00:12:38,136
The other one is constant query
for a logarithmic independence.

207
00:12:47,860 --> 00:12:49,840
So this one is actually later.

208
00:12:49,840 --> 00:12:54,610
It's by Thorpe and [? Tsang.
?] And this is by Siegel.

209
00:12:54,610 --> 00:12:59,800
Both 2004, so fairly recent.

210
00:12:59,800 --> 00:13:01,960
It takes a fair amount of space.

211
00:13:01,960 --> 00:13:04,630
This paper proves that
to get constant query

212
00:13:04,630 --> 00:13:06,100
time for logarithmic
independence

213
00:13:06,100 --> 00:13:09,640
you'd need quite a bit of space
to store your hash function.

214
00:13:09,640 --> 00:13:13,330
Keep in mind, these hash
functions only take--

215
00:13:13,330 --> 00:13:16,660
well this is like k
log in bits to store.

216
00:13:16,660 --> 00:13:18,920
So this is words of space.

217
00:13:18,920 --> 00:13:22,574
So here we're only
spending k words

218
00:13:22,574 --> 00:13:23,740
to store this hash function.

219
00:13:23,740 --> 00:13:24,360
It's very small.

220
00:13:24,360 --> 00:13:25,420
Here you need
something depending

221
00:13:25,420 --> 00:13:27,211
on n, which is kind of
annoying, especially

222
00:13:27,211 --> 00:13:29,290
if you want to be dynamic.

223
00:13:29,290 --> 00:13:33,920
But statically you can get
constant query logarithmic wise

224
00:13:33,920 --> 00:13:36,205
independence, but you have
to pay a lot in space.

225
00:13:39,999 --> 00:13:41,290
There's more practical methods.

226
00:13:41,290 --> 00:13:43,010
This is especially
practical for k

227
00:13:43,010 --> 00:13:47,320
equals 5, which is a case
that we'll see is of interest.

228
00:13:50,010 --> 00:13:52,030
Cool.

229
00:13:52,030 --> 00:13:55,000
So this much space is
necessary if you want.

230
00:13:55,000 --> 00:13:57,940
We'll see log wise independence
is the most we'll ever

231
00:13:57,940 --> 00:14:01,060
require in this class.

232
00:14:04,987 --> 00:14:06,820
And as far as I know,
in hashing in general.

233
00:14:06,820 --> 00:14:08,361
So you don't need
to worry about more

234
00:14:08,361 --> 00:14:11,982
than log wise independence.

235
00:14:11,982 --> 00:14:17,555
All right, one more
hashing scheme.

236
00:14:17,555 --> 00:14:21,120
It's called simple
tabulation hashing.

237
00:14:32,730 --> 00:14:35,790
This is a simple idea.

238
00:14:35,790 --> 00:14:39,660
It goes also back
to '81 but it's just

239
00:14:39,660 --> 00:14:42,740
been analyzed last year.

240
00:14:42,740 --> 00:14:47,000
So there's a lot of
results to report on it.

241
00:14:47,000 --> 00:14:49,500
The idea is just
take your integer,

242
00:14:49,500 --> 00:14:55,720
split it up into some base
so that there's exactly

243
00:14:55,720 --> 00:14:58,720
c characters.

244
00:14:58,720 --> 00:15:01,050
c is going to be a constant.

245
00:15:01,050 --> 00:15:08,070
Then build a totally
random hash table.

246
00:15:08,070 --> 00:15:11,020
This is the thing that
we couldn't afford.

247
00:15:11,020 --> 00:15:13,020
But we're just going to
do it on each character.

248
00:15:21,269 --> 00:15:23,185
So there's going to be
c of these hash tables.

249
00:15:26,290 --> 00:15:33,130
And each of them is going to
have size u to the 1 over c.

250
00:15:33,130 --> 00:15:37,960
So essentially we're getting u
to the epsilon space, which is

251
00:15:37,960 --> 00:15:40,030
similar to these space bounds.

252
00:15:40,030 --> 00:15:45,100
So again, not great, but it's
a really simple hash function.

253
00:15:47,920 --> 00:15:52,780
Hash function is just going to
be you take your first table,

254
00:15:52,780 --> 00:15:56,110
apply it to the first
character, x over that,

255
00:15:56,110 --> 00:16:00,220
with the second table applied
to the second character,

256
00:16:00,220 --> 00:16:02,980
and so on through
all the characters.

257
00:16:06,260 --> 00:16:08,994
So the nice thing about
this is it's super simple.

258
00:16:08,994 --> 00:16:10,660
You can imagine this
being done probably

259
00:16:10,660 --> 00:16:14,800
in one instruction
on a fancy CPU.

260
00:16:14,800 --> 00:16:17,470
if you convince people this
is a cool enough instruction

261
00:16:17,470 --> 00:16:17,970
to have.

262
00:16:17,970 --> 00:16:21,940
It's very simple to
implement circuit wide.

263
00:16:21,940 --> 00:16:25,270
But in our model you have
to do all these operations

264
00:16:25,270 --> 00:16:26,100
separately.

265
00:16:26,100 --> 00:16:30,700
You're going to take
orders c time to compute.

266
00:16:30,700 --> 00:16:32,770
And one thing that's
known about it

267
00:16:32,770 --> 00:16:39,340
is that it's three independent,
three wise independent.

268
00:16:39,340 --> 00:16:41,310
So it does kind of
fit in this model.

269
00:16:41,310 --> 00:16:44,140
But three wise independence
is not very impressive.

270
00:16:44,140 --> 00:16:47,740
A lot of the results we'll see
require log n independence.

271
00:16:47,740 --> 00:16:51,970
But the cool thing is, roughly
speaking simple tabulation

272
00:16:51,970 --> 00:16:54,830
hashing is almost as good
as log n wise independence

273
00:16:54,830 --> 00:16:58,280
in all the hashing schemes
that we care about.

274
00:16:58,280 --> 00:17:01,990
And so we'll get there,
exactly what that means.

275
00:17:04,950 --> 00:17:09,810
So that was my overview of some
hash functions, these two guys.

276
00:17:09,810 --> 00:17:13,609
Next we're going to
look at basic chaining.

277
00:17:16,800 --> 00:17:17,569
Perfect hashing.

278
00:17:20,437 --> 00:17:21,910
How many people
have seen perfect

279
00:17:21,910 --> 00:17:25,000
hashing just to get a sense?

280
00:17:25,000 --> 00:17:26,980
More than half.

281
00:17:26,980 --> 00:17:29,020
Maybe 2/3.

282
00:17:29,020 --> 00:17:32,440
All right, I should
do this really fast.

283
00:17:32,440 --> 00:17:38,314
Chaining, this is the first
kind of hashing you usually see.

284
00:17:38,314 --> 00:17:39,730
You have your hash
function, which

285
00:17:39,730 --> 00:17:42,700
is mapping keys into slots.

286
00:17:42,700 --> 00:17:45,010
If you have two keys
that go to the same slot

287
00:17:45,010 --> 00:17:47,960
you store them as a linked list.

288
00:17:47,960 --> 00:17:51,290
OK, if you don't have anything
in this slot, it's blank.

289
00:17:51,290 --> 00:17:53,410
This is very easy.

290
00:17:53,410 --> 00:17:58,120
If you look at a
particular slot t and call

291
00:17:58,120 --> 00:18:02,200
the length of the chain
that you get there Ct.

292
00:18:02,200 --> 00:18:05,860
You can look at the expected
length of that chain.

293
00:18:05,860 --> 00:18:11,260
In general, it's just going
to be sum of the probability

294
00:18:11,260 --> 00:18:19,410
that the keys map to that slot.

295
00:18:19,410 --> 00:18:21,780
And then you sum over all keys.

296
00:18:21,780 --> 00:18:23,900
This is just writing this
as a sum of indicator

297
00:18:23,900 --> 00:18:27,660
random variables, and then
each linearity of expectation,

298
00:18:27,660 --> 00:18:30,300
expectation of each of
the indicator variables

299
00:18:30,300 --> 00:18:31,770
is probability.

300
00:18:31,770 --> 00:18:33,240
So that is the expected number.

301
00:18:39,770 --> 00:18:41,610
Here we just need to
compute the probability

302
00:18:41,610 --> 00:18:43,290
each guy goes to each slot.

303
00:18:43,290 --> 00:18:46,810
As long as your hash
function is uniform,

304
00:18:46,810 --> 00:18:50,070
meaning that each of these
guys is equally likely

305
00:18:50,070 --> 00:18:52,097
to be hashed to.

306
00:18:52,097 --> 00:18:54,180
Well actually, we're looking
at a particular slot.

307
00:18:54,180 --> 00:18:57,990
So we're essentially
using universality here.

308
00:18:57,990 --> 00:19:00,490
Once we fix one slot
that we care about,

309
00:19:00,490 --> 00:19:04,520
so let t be some h of
y that we care about,

310
00:19:04,520 --> 00:19:07,230
then this is universality.

311
00:19:07,230 --> 00:19:10,160
By universality we
know this is 1 over m.

312
00:19:10,160 --> 00:19:16,890
And so this is 1 over n over m,
usually called the load factor.

313
00:19:16,890 --> 00:19:22,170
And what we care about
is this is constant for m

314
00:19:22,170 --> 00:19:23,220
equal theta n.

315
00:19:23,220 --> 00:19:26,010
And so you use table
doubling to keep m theta n.

316
00:19:26,010 --> 00:19:30,580
Boom, you've got expected
chain length constant.

317
00:19:30,580 --> 00:19:36,090
But in the theory world
expected is a very weak bound.

318
00:19:36,090 --> 00:19:38,007
What we want are high
probability bounds.

319
00:19:38,007 --> 00:19:40,590
So let me tell you a little bit
about high probability bounds.

320
00:19:40,590 --> 00:19:42,390
This, you may not
have seen as much.

321
00:19:50,520 --> 00:19:55,390
Let's start with if your hash
function is totally random,

322
00:19:55,390 --> 00:20:01,914
then your chain lengths will be
order log m over a log log n,

323
00:20:01,914 --> 00:20:02,830
with high probability.

324
00:20:02,830 --> 00:20:05,050
They are not constant.

325
00:20:05,050 --> 00:20:07,680
In fact, you expect
the maximum chain

326
00:20:07,680 --> 00:20:09,920
to be at least log
n over log log n.

327
00:20:09,920 --> 00:20:11,370
I won't prove that here.

328
00:20:11,370 --> 00:20:13,670
Instead, I'll prove
the upper bound.

329
00:20:13,670 --> 00:20:20,240
So the claim is that while
in expectation each of them

330
00:20:20,240 --> 00:20:23,360
is constant, variance
is essentially high.

331
00:20:32,610 --> 00:20:34,700
Actually let's talk about
variance a little bit.

332
00:20:34,700 --> 00:20:36,779
Sorry, I'm getting distracted.

333
00:20:41,470 --> 00:20:44,052
So you might say, oh
OK, expectation is nice.

334
00:20:44,052 --> 00:20:45,010
Let's look at variance.

335
00:20:45,010 --> 00:20:49,905
Turns out variance is
constant for these chains.

336
00:20:49,905 --> 00:20:51,655
There are various
definitions of variance.

337
00:20:55,010 --> 00:20:58,210
But in particular the formula
I want to use is this one.

338
00:21:01,472 --> 00:21:05,140
It writes it as
some expectations.

339
00:21:05,140 --> 00:21:08,057
Now, this expected chain
length we know is constant.

340
00:21:08,057 --> 00:21:09,640
So you square it,
it's still constant.

341
00:21:09,640 --> 00:21:11,520
So that's sort of irrelevant.

342
00:21:11,520 --> 00:21:15,995
The interesting part is what
is the expected squared chain

343
00:21:15,995 --> 00:21:17,230
length.

344
00:21:17,230 --> 00:21:20,320
Now this is going to depend
exactly on your hash function.

345
00:21:20,320 --> 00:21:22,570
Let's analyze it
for totally random.

346
00:21:22,570 --> 00:21:25,900
In general, we just need a
certain kind of symmetry here.

347
00:21:25,900 --> 00:21:35,890
You can write I will look at the
expected squared chain lengths.

348
00:21:40,000 --> 00:21:45,770
And instead, what I'd like to
do is just sum over all of them.

349
00:21:45,770 --> 00:21:48,540
This is going to be
easier to analyze.

350
00:21:48,540 --> 00:21:50,620
So this is expected
squared chain lengths.

351
00:21:50,620 --> 00:21:55,520
If I sum over all chains and
then divide, take the average,

352
00:21:55,520 --> 00:21:57,259
then I'll probably
get the expected chain

353
00:21:57,259 --> 00:21:58,300
length of any individual.

354
00:21:58,300 --> 00:22:00,049
As long as your hash
function is symmetric

355
00:22:00,049 --> 00:22:02,020
all the keys are sort
of equally likely.

356
00:22:02,020 --> 00:22:02,770
This will be true.

357
00:22:05,926 --> 00:22:07,300
And then you could
just basically

358
00:22:07,300 --> 00:22:08,860
apply a random
permutation to your keys

359
00:22:08,860 --> 00:22:10,443
to make this true
if it isn't already.

360
00:22:13,360 --> 00:22:16,750
Now this thing is
just the number

361
00:22:16,750 --> 00:22:20,240
of pairs of keys that collide.

362
00:22:20,240 --> 00:22:25,850
So you can forget about
slots, this is just the sum

363
00:22:25,850 --> 00:22:32,390
overall pairs of keys
ij of the probability

364
00:22:32,390 --> 00:22:37,400
that xi hashes to
the same spot as xj.

365
00:22:37,400 --> 00:22:41,330
And that's something we
know by universality.

366
00:22:41,330 --> 00:22:43,790
This is 1 over m.

367
00:22:43,790 --> 00:22:46,940
Big O.

368
00:22:46,940 --> 00:22:49,940
The number of
pairs is m squared.

369
00:22:49,940 --> 00:22:53,110
So we get m squared times
1 over m, times 1 over m.

370
00:22:53,110 --> 00:22:53,840
This is constant.

371
00:22:58,620 --> 00:23:00,230
So the variance
is actually small.

372
00:23:00,230 --> 00:23:03,240
It's not a good indicator of
how big our chains can get.

373
00:23:03,240 --> 00:23:05,960
Because still, with time
probability one of the chains

374
00:23:05,960 --> 00:23:07,340
will be log n over log log n.

375
00:23:07,340 --> 00:23:10,900
It's just typical one won't be.

376
00:23:10,900 --> 00:23:12,740
Let's prove the upper bound.

377
00:23:12,740 --> 00:23:14,840
This uses Chernoff bounds.

378
00:23:18,090 --> 00:23:21,289
This is a tail
bound, essentially.

379
00:23:21,289 --> 00:23:23,330
I haven't probably defined
with high probability.

380
00:23:23,330 --> 00:23:25,365
It's probably good to
remember, review this.

381
00:23:25,365 --> 00:23:28,640
This means probability
at least 1 minus 1

382
00:23:28,640 --> 00:23:34,370
over n to c where I get
to choose any constant c.

383
00:23:34,370 --> 00:23:38,990
So high probability means
polynomially small failure

384
00:23:38,990 --> 00:23:40,250
probability.

385
00:23:40,250 --> 00:23:43,310
This is good because if you do
this polynomially many times

386
00:23:43,310 --> 00:23:46,280
this property remains true.

387
00:23:46,280 --> 00:23:49,280
You just up your constant
by however many times

388
00:23:49,280 --> 00:23:51,840
you're going to use it.

389
00:23:51,840 --> 00:23:56,330
So we prove these kinds of
bounds using Chernov, which

390
00:23:56,330 --> 00:23:58,460
looks something like this.

391
00:23:58,460 --> 00:24:07,608
e to the c minus 1
mu over c mu c mu.

392
00:24:07,608 --> 00:24:10,240
So mu here is the mean.

393
00:24:10,240 --> 00:24:12,820
The mean we've already
computed is constant.

394
00:24:12,820 --> 00:24:16,830
The expectation of the
ct variable is constant.

395
00:24:16,830 --> 00:24:20,340
So we want it to be not
much larger than that.

396
00:24:20,340 --> 00:24:23,980
So say the probability that
it's some factor larger--

397
00:24:23,980 --> 00:24:26,480
c doesn't have to be constant
here, sorry, maybe not

398
00:24:26,480 --> 00:24:29,900
great terminology.

399
00:24:29,900 --> 00:24:34,839
So the probability of ct is at
least some c times the mean,

400
00:24:34,839 --> 00:24:36,505
is going to be at
most this exponential.

401
00:24:36,505 --> 00:24:39,980
Which is a bit
annoying, or a bit ugly.

402
00:24:39,980 --> 00:24:45,626
But in particular, if we plug in
c equals log n over log log n,

403
00:24:45,626 --> 00:24:50,840
use that as our factor, which is
what we're concerned about here

404
00:24:50,840 --> 00:24:51,860
then.

405
00:24:51,860 --> 00:24:58,070
We get that this probability
is essentially dominated

406
00:24:58,070 --> 00:25:00,170
by the bottom term here.

407
00:25:00,170 --> 00:25:07,100
And this becomes
log n over log log n

408
00:25:07,100 --> 00:25:09,800
to the power log
n over log log n.

409
00:25:12,620 --> 00:25:16,720
So essentially, get 1 over that.

410
00:25:16,720 --> 00:25:21,470
And if you take this bottom part
and put it into the exponent,

411
00:25:21,470 --> 00:25:23,560
you get essentially log log n.

412
00:25:23,560 --> 00:25:27,035
So this is something
like 1 over 2

413
00:25:27,035 --> 00:25:33,780
to the log n over log
log n times log log n.

414
00:25:33,780 --> 00:25:35,210
And the log log n's cancel.

415
00:25:35,210 --> 00:25:38,300
And so this is
basically 1 over n.

416
00:25:38,300 --> 00:25:40,850
And if you put a
constant in here

417
00:25:40,850 --> 00:25:43,920
you can get a constant
in the exponent here.

418
00:25:43,920 --> 00:25:47,366
So you can get failure
probability 1 over n to the c.

419
00:25:47,366 --> 00:25:48,740
So get this with
high probability

420
00:25:48,740 --> 00:25:53,690
bound as long as
you go up to a chain

421
00:25:53,690 --> 00:25:55,310
length of log n over log log n.

422
00:25:55,310 --> 00:25:56,960
It's not true otherwise.

423
00:25:56,960 --> 00:25:58,370
So this is kind of depressing.

424
00:25:58,370 --> 00:26:01,070
It's one reason we will
turn to perfect hashing,

425
00:26:01,070 --> 00:26:02,780
some of the chains are long.

426
00:26:02,780 --> 00:26:05,540
But there is a sense in
which this is not so bad.

427
00:26:08,910 --> 00:26:12,200
So let me go to that.

428
00:26:21,020 --> 00:26:23,010
I kind of want all these.

429
00:26:29,432 --> 00:26:32,841
AUDIENCE: [INAUDIBLE]

430
00:26:33,611 --> 00:26:34,694
ERIK DEMAINE: What's that?

431
00:26:34,694 --> 00:26:36,880
AUDIENCE: [INAUDIBLE]

432
00:26:36,880 --> 00:26:38,590
ERIK DEMAINE: Since
when is log n long.

433
00:26:38,590 --> 00:26:39,345
Well--

434
00:26:39,345 --> 00:26:40,960
AUDIENCE: [INAUDIBLE]

435
00:26:40,960 --> 00:26:43,100
ERIK DEMAINE: Right, so
I mean, in some sense

436
00:26:43,100 --> 00:26:44,599
the name of the
game here is we want

437
00:26:44,599 --> 00:26:46,300
to beat binary search trees.

438
00:26:46,300 --> 00:26:48,349
I didn't even mention what
problem we're solving.

439
00:26:48,349 --> 00:26:50,140
We're solving the
dictionary problem, which

440
00:26:50,140 --> 00:26:52,840
is sort of bunch of
keys, insert delete,

441
00:26:52,840 --> 00:26:54,520
and search is now
just exact search.

442
00:26:54,520 --> 00:26:56,020
I want to know is
this key in there?

443
00:26:56,020 --> 00:26:59,660
If so, find some data
associated with it.

444
00:26:59,660 --> 00:27:02,920
Which is something binary search
trees could do, n log n time.

445
00:27:02,920 --> 00:27:04,750
And we've seen
various fancy ways

446
00:27:04,750 --> 00:27:06,420
to try to make that better.

447
00:27:06,420 --> 00:27:08,170
But in the worst case,
you need log n time

448
00:27:08,170 --> 00:27:09,253
to do binary search trees.

449
00:27:09,253 --> 00:27:12,940
We want to get to constant
as much as possible.

450
00:27:12,940 --> 00:27:17,470
We want the hash function to be
evaluatable in constant time.

451
00:27:17,470 --> 00:27:21,280
We want the queries to
be done in constant time.

452
00:27:21,280 --> 00:27:24,550
If you have a long chain, you've
got to search the whole chain

453
00:27:24,550 --> 00:27:29,290
and I don't want to spend
log n over log log n.

454
00:27:29,290 --> 00:27:31,250
Because I said so.

455
00:27:31,250 --> 00:27:34,220
Admittedly, log n over
log log n is not that big.

456
00:27:34,220 --> 00:27:36,470
And furthermore,
the following holds.

457
00:27:36,470 --> 00:27:39,230
This is a sense in
which it's not really

458
00:27:39,230 --> 00:27:41,450
log n over log log n.

459
00:27:41,450 --> 00:27:46,230
If we change the
model briefly and say,

460
00:27:46,230 --> 00:27:50,300
well, suppose I have a cache
of the last log n items

461
00:27:50,300 --> 00:27:53,420
that I searched for
in the hash table.

462
00:27:56,630 --> 00:28:00,039
Then if you're
totally random, which

463
00:28:00,039 --> 00:28:01,580
is something we
assumed here in order

464
00:28:01,580 --> 00:28:04,320
to apply the Chernoff bound,
we needed that everything

465
00:28:04,320 --> 00:28:06,750
was completely random.

466
00:28:06,750 --> 00:28:12,050
Then you get a constant
amortized bound per operation.

467
00:28:15,110 --> 00:28:16,610
So this is kind of funny.

468
00:28:16,610 --> 00:28:20,550
In fact, all it's saying
this is easy to prove.

469
00:28:20,550 --> 00:28:22,260
And it's not yet in any paper.

470
00:28:22,260 --> 00:28:30,440
It's on Mihai Petrescu's
blog from 2011.

471
00:28:30,440 --> 00:28:31,514
All right, we're here.

472
00:28:31,514 --> 00:28:32,930
We're looking at
different chains.

473
00:28:32,930 --> 00:28:35,420
So you access some chain,
then you access another chain,

474
00:28:35,420 --> 00:28:37,640
then you access another chain.

475
00:28:37,640 --> 00:28:39,740
If you're unlucky,
you'll hit the big chain

476
00:28:39,740 --> 00:28:41,930
which cost log n
over over log log n

477
00:28:41,930 --> 00:28:46,190
to touch, which is expensive.

478
00:28:46,190 --> 00:28:48,410
But you could then put
all those guys into cache,

479
00:28:48,410 --> 00:28:52,320
and if you happen to
keep probing there

480
00:28:52,320 --> 00:28:55,070
you know it should be faster.

481
00:28:55,070 --> 00:28:58,370
In general, you do
a bunch of searches.

482
00:28:58,370 --> 00:29:06,020
OK, first I search for x1, then
I search for x2, x3, so on.

483
00:29:06,020 --> 00:29:10,706
Cluster those into
groups theta log n.

484
00:29:10,706 --> 00:29:12,080
OK, let's look at
the first log n

485
00:29:12,080 --> 00:29:15,160
searches, then the next log
n searches, and analyze those

486
00:29:15,160 --> 00:29:16,140
separately.

487
00:29:16,140 --> 00:29:20,130
We're going to amortize
over that period of log n.

488
00:29:20,130 --> 00:29:25,290
So if we look at theta log n--

489
00:29:25,290 --> 00:29:27,290
actually, this is
written in a funny way.

490
00:29:34,342 --> 00:29:36,200
You've got the data, just log n.

491
00:29:42,590 --> 00:29:45,500
So I'm going to look at a
batch of log n operations.

492
00:29:45,500 --> 00:29:48,590
And I claim that
the number of keys

493
00:29:48,590 --> 00:29:56,126
that collide with them is theta
log n, with high probability.

494
00:29:59,530 --> 00:30:06,460
If this is true, then
it's constant each.

495
00:30:06,460 --> 00:30:11,380
If I can do log n operations
by visiting order log n total

496
00:30:11,380 --> 00:30:13,810
chain items with
high probability,

497
00:30:13,810 --> 00:30:15,550
then I just charge one each.

498
00:30:15,550 --> 00:30:18,075
And so amortized over
this little log n window

499
00:30:18,075 --> 00:30:20,470
of sort of smoothing the cost.

500
00:30:20,470 --> 00:30:23,410
With high probability
now, not just expectation,

501
00:30:23,410 --> 00:30:26,560
I get constant
amortized per operation.

502
00:30:26,560 --> 00:30:29,050
So I should have said,
with high probability.

503
00:30:33,040 --> 00:30:35,050
Why is this true?

504
00:30:35,050 --> 00:30:36,580
It's essentially
the same argument.

505
00:30:36,580 --> 00:30:40,000
Here this is normally called
a balls and bin argument.

506
00:30:40,000 --> 00:30:42,910
So you're throwing balls,
which are your keys, randomly

507
00:30:42,910 --> 00:30:46,600
into your bins,
which are your slots.

508
00:30:46,600 --> 00:30:51,574
And the expectation is constant
probability to any one of them,

509
00:30:51,574 --> 00:30:52,990
I mean any one of
them could go up

510
00:30:52,990 --> 00:30:56,590
to log n over log log
n, high probability.

511
00:30:56,590 --> 00:30:59,950
Over here, we're looking
at log n different slots

512
00:30:59,950 --> 00:31:04,990
and taking the sum of balls that
fall into each of the slots.

513
00:31:04,990 --> 00:31:08,412
And in expectation that's log
n, because it's constant each.

514
00:31:08,412 --> 00:31:10,120
An expectation is
linear if you're taking

515
00:31:10,120 --> 00:31:12,760
the sum over these log n bins.

516
00:31:12,760 --> 00:31:14,620
So the expectation is log n.

517
00:31:14,620 --> 00:31:16,390
So you apply Chernoff again.

518
00:31:16,390 --> 00:31:19,210
Except now the mean is log n.

519
00:31:19,210 --> 00:31:22,080
And then it suffices
to put c equals 2.

520
00:31:22,080 --> 00:31:23,410
We can run through this.

521
00:31:23,410 --> 00:31:27,130
So here we get the
mean is theta log n.

522
00:31:27,130 --> 00:31:30,550
We expect there to
be log n items that

523
00:31:30,550 --> 00:31:33,190
fall into these log n bins.

524
00:31:33,190 --> 00:31:36,730
And so you just plug in c equals
2 to the Chernoff bound and you

525
00:31:36,730 --> 00:31:41,860
get e to the log n--
which is kind of weird--

526
00:31:41,860 --> 00:31:49,420
over 2 log n to the 2 log n.

527
00:31:49,420 --> 00:31:58,060
So this thing is like
n to the log log n 2.

528
00:31:58,060 --> 00:32:00,250
So this is big, way
bigger than this.

529
00:32:00,250 --> 00:32:02,850
So this essentially disappears.

530
00:32:02,850 --> 00:32:06,880
And in particular, this is
bigger than 1 over n to the c,

531
00:32:06,880 --> 00:32:07,945
for any c.

532
00:32:07,945 --> 00:32:11,900
So log log n is bigger
than any constant.

533
00:32:11,900 --> 00:32:12,929
So you're done.

534
00:32:12,929 --> 00:32:14,470
So that's just saying
the probability

535
00:32:14,470 --> 00:32:18,700
that you're more than twice
the mean is very, very small.

536
00:32:18,700 --> 00:32:21,951
So with high probability
there's only log n items

537
00:32:21,951 --> 00:32:23,200
that fall in these log n bins.

538
00:32:23,200 --> 00:32:25,510
So you just amortize,
boom, constant.

539
00:32:25,510 --> 00:32:27,010
This is kind of a weird notion.

540
00:32:27,010 --> 00:32:29,410
I've never actually seen
amortized with high probability

541
00:32:29,410 --> 00:32:31,070
ever in a paper.

542
00:32:31,070 --> 00:32:35,330
This is the first time it
seems like a useful concept.

543
00:32:35,330 --> 00:32:40,160
So if you think log n
over log log n is bad,

544
00:32:40,160 --> 00:32:42,880
this is a sense
in which it's OK.

545
00:32:42,880 --> 00:32:45,820
Don't worry about it.

546
00:32:45,820 --> 00:32:49,390
All right, but if you did worry
about it, next thing you do

547
00:32:49,390 --> 00:32:53,790
is perfect hashing.

548
00:33:12,266 --> 00:33:17,260
So, perfect hashing is
really just an embellishment.

549
00:33:17,260 --> 00:33:21,000
This is also called FKS hashing.

550
00:33:21,000 --> 00:33:27,800
From the authors, Friedman,
Komlosh, and [? Samaretti. ?]

551
00:33:27,800 --> 00:33:30,350
This is from 1984, so old idea.

552
00:33:30,350 --> 00:33:33,170
You just take
chaining, but instead

553
00:33:33,170 --> 00:33:34,880
of storing your chains
in a linked list,

554
00:33:34,880 --> 00:33:36,590
store them in a hash table.

555
00:33:36,590 --> 00:33:38,550
Simple idea.

556
00:33:38,550 --> 00:33:39,920
There's one clever trick.

557
00:33:44,036 --> 00:33:45,410
You store it in
a big hash table.

558
00:33:48,410 --> 00:33:53,175
Hash table of size
theta ct squared.

559
00:33:55,960 --> 00:33:58,761
Now this looks like a
problem because that's

560
00:33:58,761 --> 00:34:01,260
going to be quadratic space,
in the worst case, if everybody

561
00:34:01,260 --> 00:34:02,844
hashes to the same chain.

562
00:34:02,844 --> 00:34:04,260
But we know that
chains are pretty

563
00:34:04,260 --> 00:34:07,030
small with high probability.

564
00:34:07,030 --> 00:34:09,989
So turns out this is OK.

565
00:34:09,989 --> 00:34:15,454
The space is sum of ct squared.

566
00:34:19,250 --> 00:34:21,760
And that's something we
actually computed already,

567
00:34:21,760 --> 00:34:23,236
except I erased it.

568
00:34:23,236 --> 00:34:24,110
How convenient of me.

569
00:34:24,110 --> 00:34:25,179
It was right here.

570
00:34:25,179 --> 00:34:26,780
I can still barely read it.

571
00:34:26,780 --> 00:34:29,330
When we computed the variance.

572
00:34:29,330 --> 00:34:31,219
We can do it again it's
not really that hard.

573
00:34:31,219 --> 00:34:34,770
This is the number of
pairs of keys that collide.

574
00:34:34,770 --> 00:34:39,980
And so there's n squared
pairs and each of them

575
00:34:39,980 --> 00:34:43,070
has a probability 1 over
me of colliding if you

576
00:34:43,070 --> 00:34:45,300
have a universal hash function.

577
00:34:45,300 --> 00:34:49,550
So this is n squared
over m, which

578
00:34:49,550 --> 00:34:52,580
if m is within a constant
factor of n, is linear.

579
00:34:59,890 --> 00:35:02,440
So linear space in expectation.

580
00:35:04,990 --> 00:35:07,210
Expected amount of
space is linear.

581
00:35:07,210 --> 00:35:10,580
I won't try to do with high
probability bound here.

582
00:35:10,580 --> 00:35:11,620
What else can I say?

583
00:35:16,190 --> 00:35:18,470
You have to play a similar
trick when you're actually

584
00:35:18,470 --> 00:35:19,700
building these hash tables.

585
00:35:19,700 --> 00:35:22,850
All right, so why
do we use n squared?

586
00:35:22,850 --> 00:35:24,980
Because of the birthday paradox.

587
00:35:24,980 --> 00:35:29,870
So if you have a hash table of
size n squared, essentially,

588
00:35:29,870 --> 00:35:34,610
or ct squared with
high probability

589
00:35:34,610 --> 00:35:37,560
or constant probability you
don't get any collisions.

590
00:35:37,560 --> 00:35:38,150
Why?

591
00:35:38,150 --> 00:35:47,200
Because then they expected
number of collisions in ct.

592
00:35:47,200 --> 00:35:50,570
Well, there's ct pairs
or ct squared pairs.

593
00:35:50,570 --> 00:35:52,970
Each of them, if you're
using universal hashing,

594
00:35:52,970 --> 00:35:58,190
had a probability of 1 over
ct squared of happening.

595
00:35:58,190 --> 00:36:00,752
Because that's the
table size, 1 over m.

596
00:36:00,752 --> 00:36:04,004
So this is constant.

597
00:36:04,004 --> 00:36:05,420
And if we set the
constants right,

598
00:36:05,420 --> 00:36:08,610
I get to set this theta
to be whatever I want.

599
00:36:08,610 --> 00:36:12,960
I get this to be less than 1/2.

600
00:36:12,960 --> 00:36:15,740
If the expected number of
collisions is less than 1/2,

601
00:36:15,740 --> 00:36:25,100
then the probability that
the number of collisions is 0

602
00:36:25,100 --> 00:36:26,960
is at least a 1/2.

603
00:36:26,960 --> 00:36:31,960
This is Markov's
inequality, in particular.

604
00:36:31,960 --> 00:36:34,100
The probability number of
collisions is at least 1

605
00:36:34,100 --> 00:36:43,990
is at most the expectation
over 1, so which is 1/2.

606
00:36:43,990 --> 00:36:46,510
So you try to build this table.

607
00:36:46,510 --> 00:36:48,907
If you have 0
collisions you're happy.

608
00:36:48,907 --> 00:36:49,990
You go on to the next one.

609
00:36:49,990 --> 00:36:52,090
If you don't have 0
collisions, just try again.

610
00:36:52,090 --> 00:36:54,065
So in an expected
constant number of trials

611
00:36:54,065 --> 00:36:55,440
you're flipping
a coin each time.

612
00:36:55,440 --> 00:36:56,920
Eventually you'll get heads.

613
00:36:56,920 --> 00:36:58,880
Then you can build this table.

614
00:36:58,880 --> 00:37:00,430
And then you have 0 collisions.

615
00:37:00,430 --> 00:37:03,040
So we always want them
to be collision free.

616
00:37:07,930 --> 00:37:12,680
So in an expected linear
time you can build this table

617
00:37:12,680 --> 00:37:14,540
and it will have
expected linear space.

618
00:37:14,540 --> 00:37:15,820
In fact, if it doesn't
have linear space

619
00:37:15,820 --> 00:37:17,653
you can just try the
whole thing over again.

620
00:37:17,653 --> 00:37:19,300
So in expected
linear time you'll

621
00:37:19,300 --> 00:37:21,280
build a guaranteed
linear space structure.

622
00:37:21,280 --> 00:37:22,870
The nice thing about
perfect hashing

623
00:37:22,870 --> 00:37:26,650
is you're doing two hash
de-references and that's it.

624
00:37:26,650 --> 00:37:34,050
So the query is
constant deterministic.

625
00:37:34,050 --> 00:37:38,350
Queries are now deterministic,
only updates are randomized.

626
00:37:38,350 --> 00:37:39,760
I didn't talk about updates.

627
00:37:39,760 --> 00:37:41,760
I talked about building.

628
00:37:41,760 --> 00:37:43,900
The construction
here is randomized,

629
00:37:43,900 --> 00:37:45,820
queries are constant
deterministic.

630
00:37:45,820 --> 00:37:50,020
Now, you can make this dynamic
in pretty much the obvious way

631
00:37:50,020 --> 00:37:52,700
you say, OK, I want to insert.

632
00:37:52,700 --> 00:37:59,410
So I compute which of the, it's
essentially two level hashing.

633
00:37:59,410 --> 00:38:02,710
So first you figure out where
it fits in the big hash table,

634
00:38:02,710 --> 00:38:05,920
then you find the corresponding
chain, which is now hash table,

635
00:38:05,920 --> 00:38:07,900
and you insert into
that hash table.

636
00:38:07,900 --> 00:38:09,160
So it's the obvious thing.

637
00:38:09,160 --> 00:38:12,190
The trouble is you might get a
collision in that hash table.

638
00:38:12,190 --> 00:38:16,180
If you get a collision, you
rebuild that hash table.

639
00:38:16,180 --> 00:38:20,740
Probability of the collision
happening is essentially small.

640
00:38:20,740 --> 00:38:25,659
It's going to remain small
because of this argument.

641
00:38:25,659 --> 00:38:27,700
Because we know the expected
number of collisions

642
00:38:27,700 --> 00:38:30,820
remains small unless that
chain gets really big.

643
00:38:30,820 --> 00:38:33,760
So if the chain grows
by a factor of 2,

644
00:38:33,760 --> 00:38:37,710
then generally you have
to rebuild the table.

645
00:38:37,710 --> 00:38:43,120
But if your chain length
grows by a factor of 2,

646
00:38:43,120 --> 00:38:46,720
then you rebuild your table
to have factor 4 size larger

647
00:38:46,720 --> 00:38:50,410
because it's ct squared.

648
00:38:50,410 --> 00:38:52,300
And in general, you
maintain that the table

649
00:38:52,300 --> 00:38:56,590
is sized for a chain of roughly
the correct chain length

650
00:38:56,590 --> 00:38:58,090
within a constant factor.

651
00:38:58,090 --> 00:39:01,510
And you do doubling and halving
in the usual way, like b-tree.

652
00:39:01,510 --> 00:39:04,840
Or, I guess it's
table doubling really.

653
00:39:04,840 --> 00:39:07,420
And it will be
constant amortized

654
00:39:07,420 --> 00:39:11,320
expected per operation.

655
00:39:11,320 --> 00:39:14,650
And there's a fancy way
to make this constant

656
00:39:14,650 --> 00:39:17,410
with high probability
per insert and delete,

657
00:39:17,410 --> 00:39:19,622
which I have not read.

658
00:39:19,622 --> 00:39:21,580
But it's by [? Desfil ?]
Binger and [? Meyer ?]

659
00:39:21,580 --> 00:39:23,890
[? Ofterheight, ?] 1990.

660
00:39:23,890 --> 00:39:28,180
So, easy to make this
expected amortized.

661
00:39:28,180 --> 00:39:29,770
With more effort
you could make it

662
00:39:29,770 --> 00:39:32,630
with high probability
per operation.

663
00:39:32,630 --> 00:39:35,980
That is trickier.

664
00:39:35,980 --> 00:39:37,966
Cool.

665
00:39:37,966 --> 00:39:42,560
I actually skipped one
thing with chaining,

666
00:39:42,560 --> 00:39:43,780
which I wanted to talk about.

667
00:39:43,780 --> 00:39:49,180
So this analysis was fine,
it just used universality.

668
00:39:49,180 --> 00:39:52,060
This cache analysis,
I said totally random.

669
00:39:52,060 --> 00:39:53,920
This analysis, I
said totally random.

670
00:39:53,920 --> 00:39:55,580
What about real hash functions?

671
00:39:55,580 --> 00:39:57,070
We can't use totally random.

672
00:39:57,070 --> 00:40:00,760
What about universal k-wise
independent simple tabulation

673
00:40:00,760 --> 00:40:04,600
hashing, just for chaining?

674
00:40:04,600 --> 00:40:08,380
OK, and similar things hold
for perfect hashing, I think.

675
00:40:08,380 --> 00:40:11,610
I'm not sure if
they're all known.

676
00:40:11,610 --> 00:40:13,410
Oh, sorry, perfect hashing.

677
00:40:13,410 --> 00:40:16,540
In expectation everything's
fine, just with universal.

678
00:40:16,540 --> 00:40:18,370
So we've already done
that with universal.

679
00:40:18,370 --> 00:40:20,310
What about chaining?

680
00:40:20,310 --> 00:40:21,670
How big can the chains get?

681
00:40:21,670 --> 00:40:23,849
I said log n over log log
with a high probability,

682
00:40:23,849 --> 00:40:25,390
but our analysis
used Chernoff bound.

683
00:40:25,390 --> 00:40:27,450
That's only true for
Bernoulli trials.

684
00:40:27,450 --> 00:40:29,980
It was only true for totally
random hash functions.

685
00:40:29,980 --> 00:40:34,030
It turns out same
is true if you have

686
00:40:34,030 --> 00:40:44,890
a log n over log log n wise
independent hash function.

687
00:40:44,890 --> 00:40:46,240
So this is kind of annoying.

688
00:40:46,240 --> 00:40:49,270
If you want this to be true
you need a lot of independence.

689
00:40:49,270 --> 00:40:51,340
And it's hard to get
log n independence.

690
00:40:51,340 --> 00:40:55,060
There is a way to get constant,
but it needed a lot of space.

691
00:40:55,060 --> 00:41:02,800
That was this one, which
is not so thrilling.

692
00:41:02,800 --> 00:41:05,320
It's also kind of complicated.

693
00:41:05,320 --> 00:41:11,050
So if you don't mind
the space but you just

694
00:41:11,050 --> 00:41:14,640
want it to be simpler, you can
use simple tabulation hashing.

695
00:41:18,520 --> 00:41:23,980
Both of these, the same chain
analysis turns out to work.

696
00:41:23,980 --> 00:41:25,710
So this is fairly old.

697
00:41:25,710 --> 00:41:28,670
This is from 1995.

698
00:41:28,670 --> 00:41:31,370
This is from last year.

699
00:41:31,370 --> 00:41:33,670
So if you just use this
simple tabulation hashing,

700
00:41:33,670 --> 00:41:37,540
still has a lot of
space, e to the epsilon.

701
00:41:37,540 --> 00:41:40,210
But very simple to implement.

702
00:41:40,210 --> 00:41:43,780
Then still, the chain lengths
are as you expect them to be.

703
00:41:43,780 --> 00:41:46,579
And I believe that carries
over to this caching argument,

704
00:41:46,579 --> 00:41:47,620
but I haven't checked it.

705
00:41:50,120 --> 00:41:50,620
All right.

706
00:41:53,290 --> 00:41:56,490
Great, I think we're now happy.

707
00:41:56,490 --> 00:41:58,150
We've talked about
real hash functions

708
00:41:58,150 --> 00:42:02,200
for chaining and
perfect hashing.

709
00:42:02,200 --> 00:42:05,202
Next thing we're going to
talk about is linear probing.

710
00:42:05,202 --> 00:42:07,660
I mean, in some sense we have
good theoretical answers now.

711
00:42:07,660 --> 00:42:10,510
We can do constant
expected amortized even

712
00:42:10,510 --> 00:42:15,070
with constant
deterministic queries.

713
00:42:15,070 --> 00:42:21,260
But we're greedy, or people
like to implement all sorts

714
00:42:21,260 --> 00:42:23,000
of different hashing schemes.

715
00:42:23,000 --> 00:42:25,070
Perfect hashing is
pretty rare in practice.

716
00:42:25,070 --> 00:42:25,869
Why?

717
00:42:25,869 --> 00:42:28,160
I guess, because you have to
hash twice instead of once

718
00:42:28,160 --> 00:42:29,470
and that's just more expensive.

719
00:42:29,470 --> 00:42:31,220
So what about the
simpler hashing schemes?

720
00:42:31,220 --> 00:42:34,130
Simple tabulation hashing
is nice and simple,

721
00:42:34,130 --> 00:42:39,380
but what about linear probing?

722
00:42:42,590 --> 00:42:43,850
That's really simple.

723
00:42:59,130 --> 00:43:00,870
Linear probing is
either the first

724
00:43:00,870 --> 00:43:04,519
or the second hashing
scheme you learned.

725
00:43:04,519 --> 00:43:05,685
You store things in a table.

726
00:43:09,554 --> 00:43:11,220
The hash function
tells you where to go.

727
00:43:11,220 --> 00:43:13,094
If that's full, you just
go to the next spot.

728
00:43:13,094 --> 00:43:14,910
If that's full, you
go to the next spot

729
00:43:14,910 --> 00:43:18,000
till you find an empty slot
and then you put x there.

730
00:43:18,000 --> 00:43:21,600
So if there's some y
and z here, that that's

731
00:43:21,600 --> 00:43:24,840
where you end up putting it.

732
00:43:24,840 --> 00:43:27,450
Everyone knows
linear probing is bad

733
00:43:27,450 --> 00:43:29,070
because the rich get richer.

734
00:43:29,070 --> 00:43:30,700
It's like the
parking lot problem.

735
00:43:30,700 --> 00:43:35,940
If you get big runs of elements
they're more likely to get hit,

736
00:43:35,940 --> 00:43:39,840
so they're going to grow
even faster and get worse.

737
00:43:39,840 --> 00:43:43,070
So you should never
use linear probing.

738
00:43:43,070 --> 00:43:46,290
Has everyone learned that?

739
00:43:46,290 --> 00:43:49,500
It's all false, however.

740
00:43:49,500 --> 00:43:51,900
Linear probing is
actually really good.

741
00:43:51,900 --> 00:43:54,780
And first indication is it's
really good in practice.

742
00:43:54,780 --> 00:43:59,200
There's this small
experiment by Mihai Petrescu,

743
00:43:59,200 --> 00:44:02,370
who was an undergrad
and PhD student here.

744
00:44:02,370 --> 00:44:04,110
He's working on AT&T now.

745
00:44:04,110 --> 00:44:05,670
And he was doing
some experiments

746
00:44:05,670 --> 00:44:08,540
and he found that in
practice on a network router

747
00:44:08,540 --> 00:44:14,250
linear probing costs 10% more
time than a memory access.

748
00:44:14,250 --> 00:44:18,180
So, basically free.

749
00:44:18,180 --> 00:44:20,820
Why?

750
00:44:20,820 --> 00:44:25,370
You just set m to be 2 times
n, or 1 plus epsilon times

751
00:44:25,370 --> 00:44:27,570
n, whatever.

752
00:44:27,570 --> 00:44:28,880
It actually works really well.

753
00:44:28,880 --> 00:44:31,920
And I'd like to convince you
that it works really well.

754
00:44:31,920 --> 00:44:37,680
Now, first let me
tell you some things.

755
00:44:37,680 --> 00:44:42,120
The idea that it works
really well is old.

756
00:44:42,120 --> 00:44:56,420
For a totally
random hash function

757
00:44:56,420 --> 00:44:58,790
you require constant
time per operation.

758
00:44:58,790 --> 00:45:01,490
And Knuth actually
showed this first in 1962

759
00:45:01,490 --> 00:45:03,020
in a technical report.

760
00:45:03,020 --> 00:45:05,496
The answer ends up being
1 over epsilon squared.

761
00:45:05,496 --> 00:45:06,870
Now you might say,
1 over epsilon

762
00:45:06,870 --> 00:45:08,630
squared, oh that's really bad.

763
00:45:08,630 --> 00:45:10,380
And there are other
schemes that achieve 1

764
00:45:10,380 --> 00:45:11,780
over epsilon, which is better.

765
00:45:11,780 --> 00:45:14,180
But what's a little
bit of space, right?

766
00:45:14,180 --> 00:45:16,550
I mean, just set epsilon
to 1, you're done.

767
00:45:19,320 --> 00:45:22,460
So I think linear probing
was bad when we really

768
00:45:22,460 --> 00:45:23,390
were tight on space.

769
00:45:23,390 --> 00:45:26,212
But when you can afford a factor
of 2, linear probing is great.

770
00:45:26,212 --> 00:45:27,170
That's the bottom line.

771
00:45:29,760 --> 00:45:32,450
Now, this is totally
random, not so useful.

772
00:45:32,450 --> 00:45:34,580
What about all these
other hash functions?

773
00:45:34,580 --> 00:45:36,620
Like universal,
turns out universal

774
00:45:36,620 --> 00:45:38,330
with the universal
hash function,

775
00:45:38,330 --> 00:45:40,580
linear probing is
really, really bad.

776
00:45:40,580 --> 00:45:42,350
And that's why it
gets a bad rap.

777
00:45:42,350 --> 00:45:44,720
But some good news.

778
00:45:44,720 --> 00:45:49,580
OK, first result was
log n wise independence.

779
00:45:49,580 --> 00:45:52,220
This is extremely
strong but it also

780
00:45:52,220 --> 00:45:56,780
implies constant
expected per operation.

781
00:45:56,780 --> 00:45:59,390
Not very exciting.

782
00:45:59,390 --> 00:46:04,790
The big breakthrough was in
2007 that five-wise independence

783
00:46:04,790 --> 00:46:06,260
is enough.

784
00:46:06,260 --> 00:46:14,310
And this is why
this paper, Thorpe

785
00:46:14,310 --> 00:46:17,179
was focusing in particular
on the case of k equals 4.

786
00:46:17,179 --> 00:46:18,720
Actually, they were
doing k equals 4,

787
00:46:18,720 --> 00:46:22,340
but they solved 5
at the same time.

788
00:46:22,340 --> 00:46:26,080
And so this was a very
highly optimized, practical,

789
00:46:26,080 --> 00:46:27,180
all that good stuff.

790
00:46:27,180 --> 00:46:30,750
Get five-wise independence,
admittedly with some space.

791
00:46:30,750 --> 00:46:33,610
But it's pretty cool.

792
00:46:33,610 --> 00:46:40,490
So this is enough to
get constant expected.

793
00:46:40,490 --> 00:46:43,900
I shouldn't write order
1, because I'm not writing

794
00:46:43,900 --> 00:46:46,190
the dependence on epsilon here.

795
00:46:46,190 --> 00:46:47,950
I don't know exactly what it is.

796
00:46:47,950 --> 00:46:51,870
But it's some constant
depending on epsilon.

797
00:46:51,870 --> 00:46:56,280
And then this turns
out to be tight.

798
00:46:56,280 --> 00:47:01,200
There are four-wise
independent hash functions,

799
00:47:01,200 --> 00:47:06,120
including I think the
polynomial ones that we did.

800
00:47:06,120 --> 00:47:09,090
These guys that are really bad.

801
00:47:09,090 --> 00:47:11,380
You can get really bad.

802
00:47:11,380 --> 00:47:13,275
They're as bad as
binary search trees.

803
00:47:13,275 --> 00:47:14,670
You can get constant expected.

804
00:47:17,220 --> 00:47:19,320
So you really need
five-wise independence.

805
00:47:19,320 --> 00:47:22,540
It's kind of weird,
but it's true.

806
00:47:22,540 --> 00:47:29,460
And the other fun fact is that
simple tabulation hashing also

807
00:47:29,460 --> 00:47:30,940
achieves constant.

808
00:47:30,940 --> 00:47:34,200
And here it's known that it's
also 1 over epsilon squared.

809
00:47:34,200 --> 00:47:37,180
So it's just as good as totally
random simple tabulation

810
00:47:37,180 --> 00:47:37,680
hashing.

811
00:47:37,680 --> 00:47:39,930
Which is nice because
again, this is simple.

812
00:47:39,930 --> 00:47:46,350
Takes a bit of space but both
of these have that property.

813
00:47:46,350 --> 00:47:48,600
And so these are
good ways to use

814
00:47:48,600 --> 00:47:50,026
linear probing in particular.

815
00:47:50,026 --> 00:47:51,650
So you really need
a good hash function

816
00:47:51,650 --> 00:47:53,100
for linear probing to work out.

817
00:47:53,100 --> 00:47:56,100
If you use the universal
hash function like a times

818
00:47:56,100 --> 00:48:00,270
x mod p mod m it will fail.

819
00:48:00,270 --> 00:48:03,014
But if you use a good hash
function, which we're now

820
00:48:03,014 --> 00:48:03,930
getting to the point--

821
00:48:03,930 --> 00:48:07,101
I mean, this is super
simple to implement.

822
00:48:07,101 --> 00:48:08,130
It should work fine.

823
00:48:08,130 --> 00:48:09,755
I think would be a
neat project to take

824
00:48:09,755 --> 00:48:12,930
a Python or something that had
hash tables deep inside it,

825
00:48:12,930 --> 00:48:13,580
replace--

826
00:48:13,580 --> 00:48:17,250
I think they use quadratic
probing and universal hash

827
00:48:17,250 --> 00:48:18,750
functions.

828
00:48:18,750 --> 00:48:21,810
If you instead use linear
probing and simple tabulation

829
00:48:21,810 --> 00:48:25,812
hashing, might do the same,
might do better, I don't know.

830
00:48:25,812 --> 00:48:26,520
It's interesting.

831
00:48:26,520 --> 00:48:28,240
It would be a
project to try out.

832
00:48:31,690 --> 00:48:33,930
Cool.

833
00:48:33,930 --> 00:48:35,110
Well, I just quoted results.

834
00:48:35,110 --> 00:48:39,790
What I'd like to do is prove
something like this to you.

835
00:48:39,790 --> 00:48:43,000
Totally random hash functions
imply some constant expected.

836
00:48:43,000 --> 00:48:45,800
I won't try to work out
the dependence on epsilon

837
00:48:45,800 --> 00:48:48,790
because it's actually a pretty
clean proof, it looks nice.

838
00:48:51,750 --> 00:48:53,030
Very data structures-y.

839
00:48:59,810 --> 00:49:01,440
I'm not going to
cover Knut's proof.

840
00:49:01,440 --> 00:49:07,650
I'm essentially
covering this proof.

841
00:49:07,650 --> 00:49:09,360
In this paper
five-wise independence

842
00:49:09,360 --> 00:49:10,830
implies constant expected.

843
00:49:10,830 --> 00:49:13,320
They re-prove the
totally random case

844
00:49:13,320 --> 00:49:15,880
and strengthen it, analyze
the independence they need.

845
00:49:18,690 --> 00:49:26,400
Let's just do totally
random unbiased constant

846
00:49:26,400 --> 00:49:29,940
expected for linear probing.

847
00:49:29,940 --> 00:49:32,520
We obviously know how to do
constant expected already

848
00:49:32,520 --> 00:49:33,840
with other fancy techniques.

849
00:49:33,840 --> 00:49:37,230
But linear probing
seems really bad.

850
00:49:37,230 --> 00:49:41,580
Yet I claim, not so much.

851
00:49:41,580 --> 00:49:47,822
And we're going to assume
m is at least 3 times n.

852
00:49:47,822 --> 00:49:49,530
That will just make
the analysis cleaner.

853
00:49:49,530 --> 00:49:52,440
But it does hold
for 1 plus epsilon.

854
00:49:52,440 --> 00:49:53,760
OK, so here's the idea.

855
00:49:53,760 --> 00:49:58,890
We're going to take our array,
our hash table, it's an array.

856
00:50:02,700 --> 00:50:04,740
And build a binary tree
on it because that's

857
00:50:04,740 --> 00:50:06,390
what we like to do.

858
00:50:06,390 --> 00:50:09,900
We do this every
lecture pretty much.

859
00:50:09,900 --> 00:50:12,300
This is kind of like ordered
file maintenance, I guess.

860
00:50:12,300 --> 00:50:14,400
This is just a conceptual tree.

861
00:50:14,400 --> 00:50:16,380
I mean, you're not even
defining an algorithm

862
00:50:16,380 --> 00:50:18,630
based on this because the
algorithm is linear probing.

863
00:50:18,630 --> 00:50:19,850
You go into somewhere.

864
00:50:19,850 --> 00:50:22,140
You hop, hop, hop, hop until
you find a blank space.

865
00:50:22,140 --> 00:50:23,710
You put your item there.

866
00:50:23,710 --> 00:50:25,770
OK, but each of these
nodes defines an interval

867
00:50:25,770 --> 00:50:28,320
in the array, as we know.

868
00:50:28,320 --> 00:50:38,880
So I'm going to call
a node dangerous,

869
00:50:38,880 --> 00:50:43,350
essentially if its
density is at least 2/3.

870
00:50:43,350 --> 00:50:46,920
But not in the literal sense
because there's a little bit

871
00:50:46,920 --> 00:50:48,240
of a subtlety here.

872
00:50:48,240 --> 00:50:50,970
There's the location
where a key wants to live,

873
00:50:50,970 --> 00:50:52,740
which is h of that key.

874
00:50:52,740 --> 00:50:56,790
And there's the location
that it ended up living.

875
00:50:56,790 --> 00:50:59,400
I care more about the
first one because that's

876
00:50:59,400 --> 00:51:00,690
what I understand.

877
00:51:00,690 --> 00:51:03,820
h of x, that's going to be nice.

878
00:51:03,820 --> 00:51:04,770
It's totally random.

879
00:51:04,770 --> 00:51:07,860
So h of x is random
independent of everything else.

880
00:51:07,860 --> 00:51:09,210
Great.

881
00:51:09,210 --> 00:51:12,190
Where x ends up being,
that depends on other keys

882
00:51:12,190 --> 00:51:13,830
and it depends on
this linear thing

883
00:51:13,830 --> 00:51:15,510
which I'm trying to understand.

884
00:51:15,510 --> 00:51:22,140
So I just want to talk
about the number of keys

885
00:51:22,140 --> 00:51:34,935
that hash via h to the interval
if that is at least 2/3 times

886
00:51:34,935 --> 00:51:36,060
the length of the interval.

887
00:51:38,544 --> 00:51:40,710
This is the number of slots
that are actually there.

888
00:51:44,520 --> 00:51:47,100
We expect the number of keys
that hash via h to the interval

889
00:51:47,100 --> 00:51:48,090
to be 1/2.

890
00:51:48,090 --> 00:51:51,240
So the expectation would be
1/3 the length the interval.

891
00:51:51,240 --> 00:51:53,850
If it happens to be
2/3 it could happen

892
00:51:53,850 --> 00:51:55,770
because of high
probability, whatever.

893
00:51:55,770 --> 00:51:56,770
That's a dangerous node.

894
00:51:56,770 --> 00:51:59,100
That's the definition.

895
00:51:59,100 --> 00:52:01,752
Those ones we worry
will be very expensive.

896
00:52:01,752 --> 00:52:03,960
And we worry that we're
going to get super clustering

897
00:52:03,960 --> 00:52:06,560
and then get these
giant runs, and so on.

898
00:52:30,100 --> 00:52:33,050
So, one thing I
want to compute was

899
00:52:33,050 --> 00:52:37,010
what's the probability
of this happening.

900
00:52:37,010 --> 00:52:40,550
Probability of a
node being dangerous.

901
00:52:40,550 --> 00:52:42,800
Well, we can again use
Chernoff bounds here

902
00:52:42,800 --> 00:52:45,170
because we're in a
totally random situation.

903
00:52:45,170 --> 00:52:47,420
So this is the probability
that the number

904
00:52:47,420 --> 00:52:49,220
of things that went
there was bigger

905
00:52:49,220 --> 00:52:51,520
than twice the expectation.

906
00:52:51,520 --> 00:52:55,370
The expectation is 1/2,
2/3 is twice of 1/3.

907
00:52:55,370 --> 00:52:59,580
So this is the probability that
you're at least twice the mean,

908
00:52:59,580 --> 00:53:03,830
which by Chernoff is small.

909
00:53:03,830 --> 00:53:10,825
It comes out to e to
the mu over 2 to 2 mu.

910
00:53:16,460 --> 00:53:19,550
So this is e over 4 to the mu.

911
00:53:19,550 --> 00:53:20,820
You can check e.

912
00:53:20,820 --> 00:53:24,230
It's 2.71828.

913
00:53:24,230 --> 00:53:31,580
So this is less than 1,
kind of roughly a half-ish.

914
00:53:31,580 --> 00:53:33,350
So this is good.

915
00:53:33,350 --> 00:53:36,635
This is something like
1 over 2 to the mu.

916
00:53:36,635 --> 00:53:37,340
What's mu?

917
00:53:45,260 --> 00:53:53,340
mu is 1/3 2 to the h
for a height h node.

918
00:53:53,340 --> 00:53:55,130
It depends on how high you are.

919
00:53:55,130 --> 00:53:58,980
If you're at a leaf h is 0, so
you expect 1/3 of an element

920
00:53:58,980 --> 00:54:00,300
there.

921
00:54:00,300 --> 00:54:01,860
As you go up you
expect more elements

922
00:54:01,860 --> 00:54:04,380
to hash there, of course.

923
00:54:04,380 --> 00:54:06,450
OK, so this gives
us some measure

924
00:54:06,450 --> 00:54:09,180
in terms of this h
of what's going on.

925
00:54:09,180 --> 00:54:11,550
But it's actually
doubly exponential in h.

926
00:54:11,550 --> 00:54:13,340
So this is a very
small probability.

927
00:54:13,340 --> 00:54:15,030
You go up a few levels.

928
00:54:15,030 --> 00:54:16,670
Like, after log
log n levels it's

929
00:54:16,670 --> 00:54:20,040
a polynomially small
probability of happening.

930
00:54:20,040 --> 00:54:22,700
Because then 2 to the
log log n is log n.

931
00:54:22,700 --> 00:54:26,670
And then e over 4 to
the log n is about n.

932
00:54:26,670 --> 00:54:27,170
OK.

933
00:54:29,930 --> 00:54:35,100
But at small levels this
may happen, near the leaves.

934
00:54:35,100 --> 00:54:41,260
All right, so now I want to
look at a run in the table.

935
00:54:41,260 --> 00:54:45,270
These are the things I
have trouble thinking about

936
00:54:45,270 --> 00:54:49,500
because runs tend to get
bigger, and we worry about them.

937
00:54:49,500 --> 00:54:52,620
This is now as items are
actually stored at the table,

938
00:54:52,620 --> 00:54:55,950
when do I have a bunch of
consecutive items in there

939
00:54:55,950 --> 00:54:59,745
that happen to end up
in consecutive slots?

940
00:55:02,280 --> 00:55:05,640
So I'm worried about
how long that run is.

941
00:55:05,640 --> 00:55:12,300
So let's look at its
logarithm and round

942
00:55:12,300 --> 00:55:13,530
to the nearest power of 2.

943
00:55:13,530 --> 00:55:15,860
So let's say it has
length about 2 to the l.

944
00:55:15,860 --> 00:55:17,740
Sorry, plus 1.

945
00:55:20,270 --> 00:55:23,890
All right, between 2 to the
l and 2 to the l plus 1.

946
00:55:23,890 --> 00:55:28,830
OK, look at that.

947
00:55:28,830 --> 00:55:44,160
And it's spanned by some
number of nodes of height h

948
00:55:44,160 --> 00:55:48,100
equals l minus 3.

949
00:55:48,100 --> 00:55:51,240
OK, so there's some interval
that happens to be a run,

950
00:55:51,240 --> 00:55:54,810
meaning all of these
slots are occupied.

951
00:55:54,810 --> 00:55:58,830
And that's 2 to the
2, I guess, since I

952
00:55:58,830 --> 00:56:00,550
got to level negative 1.

953
00:56:00,550 --> 00:56:02,910
A little hard to do
in a small picture.

954
00:56:02,910 --> 00:56:04,800
But we're worried about
when this is really

955
00:56:04,800 --> 00:56:08,020
big more than some constant.

956
00:56:08,020 --> 00:56:11,460
OK, so let's suppose I
was looking at this level.

957
00:56:11,460 --> 00:56:14,520
Then this interval is
spanned, in particular,

958
00:56:14,520 --> 00:56:15,870
by these two nodes.

959
00:56:15,870 --> 00:56:18,210
Now it's a little
sloppy because this node

960
00:56:18,210 --> 00:56:20,700
contains some non-interval,
non-run stuff,

961
00:56:20,700 --> 00:56:22,860
and so does this one.

962
00:56:22,860 --> 00:56:26,750
At the next level down it would
be this way one, this one,

963
00:56:26,750 --> 00:56:28,730
and this one, which is
a little more precise.

964
00:56:28,730 --> 00:56:31,680
But it's never going
to be quite perfect.

965
00:56:31,680 --> 00:56:34,350
But just take all
the nodes you need

966
00:56:34,350 --> 00:56:37,050
to completely cover the run.

967
00:56:37,050 --> 00:56:43,630
Then this will be at least eight
nodes because the length is

968
00:56:43,630 --> 00:56:44,190
1 to the l.

969
00:56:44,190 --> 00:56:47,640
We went three levels
down, 2 to the 3 is 8.

970
00:56:47,640 --> 00:56:51,840
So if it's perfectly aligned
it will be exactly 8 nodes.

971
00:56:51,840 --> 00:56:56,700
In the worst case, it
could be as much as 17.

972
00:56:56,700 --> 00:57:00,750
Because potentially,
we're 2 to the l plus 1,

973
00:57:00,750 --> 00:57:03,170
which means we have 16 nodes
if we're perfectly aligned.

974
00:57:03,170 --> 00:57:05,100
But then if you shift
if over it might be

975
00:57:05,100 --> 00:57:07,750
one more because of the slot.

976
00:57:07,750 --> 00:57:10,630
OK, but some constant
number of nodes.

977
00:57:10,630 --> 00:57:13,780
It's important that
it's at least eight.

978
00:57:13,780 --> 00:57:15,592
That's what we need.

979
00:57:15,592 --> 00:57:17,550
Actually, we just need
that's it at least five,

980
00:57:17,550 --> 00:57:22,860
but eight is the nearest
power of two rounding up.

981
00:57:22,860 --> 00:57:23,700
Cool.

982
00:57:23,700 --> 00:57:25,500
So, there they are.

983
00:57:25,500 --> 00:57:30,750
Now, I want to look at
the first four nodes

984
00:57:30,750 --> 00:57:32,850
of these eight to 12 nodes.

985
00:57:32,850 --> 00:57:35,220
So first meaning leftmost.

986
00:57:35,220 --> 00:57:38,107
Earliest in the run.

987
00:57:38,107 --> 00:57:39,690
So if you think about
them, so there's

988
00:57:39,690 --> 00:57:43,325
some four nodes each
of them spans some--

989
00:57:43,325 --> 00:57:45,645
I should draw these properly.

990
00:57:51,090 --> 00:57:56,630
What we know is that these guys
are entirely filled with items.

991
00:57:56,630 --> 00:57:58,410
The run occupies here.

992
00:57:58,410 --> 00:58:01,740
It's got to be at least one item
into here, but the rest of this

993
00:58:01,740 --> 00:58:02,682
could be empty.

994
00:58:02,682 --> 00:58:04,390
And the interval keeps
going to the right

995
00:58:04,390 --> 00:58:06,600
so we know that all of
these are completely

996
00:58:06,600 --> 00:58:09,360
filled with items somehow.

997
00:58:09,360 --> 00:58:13,500
So let's start with how
many there are, I guess.

998
00:58:13,500 --> 00:58:26,940
They span more than three times
2 to the h slots of the run.

999
00:58:26,940 --> 00:58:29,815
So somehow 3 times 2 to the h--

1000
00:58:29,815 --> 00:58:32,190
because there's three of them
that are completely filled,

1001
00:58:32,190 --> 00:58:35,490
otherwise it would be four.

1002
00:58:35,490 --> 00:58:38,490
Somehow three times two
to the h items ended here.

1003
00:58:38,490 --> 00:58:40,030
Now, how did they end up here?

1004
00:58:40,030 --> 00:58:44,910
Notice there's a blank
space right here.

1005
00:58:44,910 --> 00:58:46,860
By definition this was
the beginning of a run.

1006
00:58:46,860 --> 00:58:50,380
Meaning the previous
slot is empty.

1007
00:58:50,380 --> 00:58:53,580
Which means all of the
keys that wanted to live

1008
00:58:53,580 --> 00:58:57,480
from here to the left got to.

1009
00:58:57,480 --> 00:58:59,730
So if we're just thinking
about the keys that ended up

1010
00:58:59,730 --> 00:59:03,840
in this interval, they had to
initially hash to somewhere

1011
00:59:03,840 --> 00:59:05,160
in here.

1012
00:59:05,160 --> 00:59:08,010
h put them somewhere
in this interval

1013
00:59:08,010 --> 00:59:10,290
and then they may have
moved to the right,

1014
00:59:10,290 --> 00:59:12,820
but they never move to
the left in linear hashing

1015
00:59:12,820 --> 00:59:14,490
if you're not completely full.

1016
00:59:14,490 --> 00:59:17,370
So because there was a blank
spot here none of these keys

1017
00:59:17,370 --> 00:59:23,310
could have fallen over
to here, no deletions.

1018
00:59:23,310 --> 00:59:25,925
So you're doing insertions.

1019
00:59:25,925 --> 00:59:27,300
They may have just
spread it out,

1020
00:59:27,300 --> 00:59:28,950
and they made sconces have
gone farther to the right,

1021
00:59:28,950 --> 00:59:31,170
or they may filled in
gaps, whatever, but h

1022
00:59:31,170 --> 00:59:33,780
put them in this interval.

1023
00:59:33,780 --> 00:59:40,350
Now, I claim that in fact,
at least one of these nodes

1024
00:59:40,350 --> 00:59:42,470
must be dangerous.

1025
00:59:42,470 --> 00:59:45,360
Now dangerous is tricky,
because dangerous is talking

1026
00:59:45,360 --> 00:59:46,980
about where h puts nodes.

1027
00:59:46,980 --> 00:59:51,180
But we just said, got to be at
least three times two to the h

1028
00:59:51,180 --> 00:59:55,620
keys, where h put them
within these four nodes,

1029
00:59:55,620 --> 00:59:58,170
otherwise they wouldn't
have filled in here.

1030
00:59:58,170 --> 01:00:11,520
Now, if none of those
nodes were dangerous,

1031
01:00:11,520 --> 01:00:15,120
then we'll get a contradiction.

1032
01:00:15,120 --> 01:00:17,610
Because none of
them were dangerous

1033
01:00:17,610 --> 01:00:25,460
this means at most
4 times 2/3 times 2

1034
01:00:25,460 --> 01:00:32,430
to the h keys hash
via h to them.

1035
01:00:37,991 --> 01:00:38,490
Why?

1036
01:00:38,490 --> 01:00:40,350
Because there's
four of the nodes.

1037
01:00:40,350 --> 01:00:45,210
Each of them, if it's not
dangerous, has at most 2/3

1038
01:00:45,210 --> 01:00:50,130
of its size keys hashing there.

1039
01:00:50,130 --> 01:00:58,670
4 times 2/3 is 8/3, which is
less than 9/3, which is 3.

1040
01:00:58,670 --> 01:01:01,620
OK, so this would
be a contradiction

1041
01:01:01,620 --> 01:01:05,070
because we just argued that at
least 3 times 2 to the h nodes

1042
01:01:05,070 --> 01:01:10,350
have to hash via h to
somewhere in these nodes.

1043
01:01:10,350 --> 01:01:12,434
They might hash here and
then fallen over to here.

1044
01:01:12,434 --> 01:01:14,724
So there is this kind of,
things can move to the right,

1045
01:01:14,724 --> 01:01:15,900
we've got to worry about it.

1046
01:01:15,900 --> 01:01:19,650
But just look three
levels up and it's OK.

1047
01:01:22,146 --> 01:01:24,270
So one of these nodes, not
necessarily all of them,

1048
01:01:24,270 --> 01:01:25,314
are dangerous.

1049
01:01:33,380 --> 01:01:36,570
And we can use that to
finish our analysis.

1050
01:01:54,690 --> 01:01:58,230
This is good news
because it says

1051
01:01:58,230 --> 01:02:00,230
that if we have a run,
which is something that's

1052
01:02:00,230 --> 01:02:02,500
hard to think about because
nodes are moving around

1053
01:02:02,500 --> 01:02:05,540
to form a run, so keys are
moving around to form a run,

1054
01:02:05,540 --> 01:02:08,101
we can charge it to
a dangerous node.

1055
01:02:08,101 --> 01:02:10,100
Which is easy to think
about because that's just

1056
01:02:10,100 --> 01:02:16,190
talking about where keys hash
via h, and h is totally random.

1057
01:02:16,190 --> 01:02:19,760
There's a loss of a
factor of 17, potentially.

1058
01:02:19,760 --> 01:02:23,350
But it's a constant
factor, no big deal.

1059
01:02:23,350 --> 01:02:28,520
If we look at the probability
that the length of a run,

1060
01:02:28,520 --> 01:02:37,130
say containing some key x,
has length between 2 to the l

1061
01:02:37,130 --> 01:02:41,420
and to the l plus
1, this is going

1062
01:02:41,420 --> 01:02:51,335
to be at most 17 times the
probability of a node at height

1063
01:02:51,335 --> 01:02:54,650
l minus 3 is dangerous.

1064
01:02:58,970 --> 01:03:02,790
Because we know one of them
is, and so just to be sloppy

1065
01:03:02,790 --> 01:03:05,881
it's at most the sum of the
probabilities that any of them

1066
01:03:05,881 --> 01:03:06,380
is.

1067
01:03:06,380 --> 01:03:08,930
Then potentially there's
a run of that length.

1068
01:03:08,930 --> 01:03:12,020
And so union bound it's at
most 17 times probability

1069
01:03:12,020 --> 01:03:12,830
of this happening.

1070
01:03:12,830 --> 01:03:14,750
Now all nodes look the
same because we have

1071
01:03:14,750 --> 01:03:16,940
a totally random hash function.

1072
01:03:16,940 --> 01:03:20,260
So we just say any node
at height l minus 3.

1073
01:03:20,260 --> 01:03:22,820
We already computed
that probability.

1074
01:03:22,820 --> 01:03:23,630
That was this.

1075
01:03:23,630 --> 01:03:27,980
Probability of being dangerous
was e over 4 to the 1/3 2

1076
01:03:27,980 --> 01:03:29,810
to the h.

1077
01:03:29,810 --> 01:03:39,250
So this is going to be at most
17 times e over 4 to the 2

1078
01:03:39,250 --> 01:03:42,150
to the l minus 3 power.

1079
01:03:42,150 --> 01:03:48,410
Again, doubly exponential in l.

1080
01:03:48,410 --> 01:03:57,890
So if we want to compute
the expected run length

1081
01:03:57,890 --> 01:04:01,280
we can just expand
out the definition.

1082
01:04:01,280 --> 01:04:05,765
Well, let's round
it to powers of 2.

1083
01:04:05,765 --> 01:04:07,850
It could be the run
length is about 2

1084
01:04:07,850 --> 01:04:11,360
to the l within a constant
factor of 2 to the l.

1085
01:04:11,360 --> 01:04:16,474
So it's going to be that
times this probability.

1086
01:04:21,420 --> 01:04:25,640
But this thing is basically
1 over 2 to the 2 to the l.

1087
01:04:25,640 --> 01:04:27,425
And so the whole
thing is constant.

1088
01:04:31,640 --> 01:04:32,579
This is l.

1089
01:04:32,579 --> 01:04:33,870
I mean, l could go to infinity.

1090
01:04:33,870 --> 01:04:34,703
I don't really care.

1091
01:04:37,320 --> 01:04:40,210
I mean, this gets dwarfed
by the double exponential.

1092
01:04:40,210 --> 01:04:42,090
This is super geometric.

1093
01:04:42,090 --> 01:04:46,470
So a very low probability
of getting long runs.

1094
01:04:46,470 --> 01:04:51,210
As we said, after
a log log n size--

1095
01:04:51,210 --> 01:04:54,690
yeah, it's very unlikely
to run longer than log n.

1096
01:04:54,690 --> 01:04:57,619
We proved that in particular.

1097
01:04:57,619 --> 01:04:59,910
But in particular, you compute
the expected run length,

1098
01:04:59,910 --> 01:05:02,800
it's constant.

1099
01:05:02,800 --> 01:05:07,010
OK, now this of course
assumed totally random.

1100
01:05:07,010 --> 01:05:09,300
It's harder to prove--

1101
01:05:09,300 --> 01:05:11,790
where were we.

1102
01:05:11,790 --> 01:05:12,660
Somewhere.

1103
01:05:12,660 --> 01:05:14,420
Linear probing.

1104
01:05:14,420 --> 01:05:16,680
It's harder to prove five-wise
independence is enough,

1105
01:05:16,680 --> 01:05:17,810
but it's true.

1106
01:05:17,810 --> 01:05:21,090
And it's much harder to
prove simple tabulation

1107
01:05:21,090 --> 01:05:22,840
hashing works, but it's true.

1108
01:05:22,840 --> 01:05:24,210
So we can use them.

1109
01:05:24,210 --> 01:05:26,610
This gives you some intuition
for why it's really not

1110
01:05:26,610 --> 01:05:28,129
that bad.

1111
01:05:28,129 --> 01:05:29,670
And similar proof
techniques are used

1112
01:05:29,670 --> 01:05:33,570
for the five-wise independence.

1113
01:05:33,570 --> 01:05:34,680
Other fun facts.

1114
01:05:34,680 --> 01:05:40,540
You can do similar caching
trick that we did before.

1115
01:05:40,540 --> 01:05:45,510
Again, the worst run is going
to be log, or log over log log.

1116
01:05:45,510 --> 01:05:48,030
I don't have it written here.

1117
01:05:48,030 --> 01:05:52,680
But if you cache the last--

1118
01:05:52,680 --> 01:05:55,320
it's not quite enough
to do the last log n.

1119
01:05:55,320 --> 01:06:06,852
But if you cache the last log
to the 1 plus epsilon n queries.

1120
01:06:06,852 --> 01:06:09,390
It's a little bit more.

1121
01:06:09,390 --> 01:06:11,440
Then you can generalize
this argument.

1122
01:06:11,440 --> 01:06:16,230
And so at least for totally
random hash functions

1123
01:06:16,230 --> 01:06:20,230
you get constant amortize
with high probability.

1124
01:06:26,120 --> 01:06:29,690
This weird thing that
I've never seen before.

1125
01:06:29,690 --> 01:06:34,280
But it's comforting because
it's expected bounds are not

1126
01:06:34,280 --> 01:06:36,590
so great, but you get it
with high probability bound

1127
01:06:36,590 --> 01:06:40,040
as long as you're willing
to average over log to the 1

1128
01:06:40,040 --> 01:06:41,785
plus epsilon different queries.

1129
01:06:41,785 --> 01:06:43,160
As long as you
can remember them.

1130
01:06:46,880 --> 01:06:49,730
And the proof is
basically the same.

1131
01:06:49,730 --> 01:06:52,610
Except now instead of looking
at the length of a run

1132
01:06:52,610 --> 01:06:55,220
containing x, you're looking
at the length of the run

1133
01:06:55,220 --> 01:06:59,810
containing one of these log
to the 1 plus epsilon n nodes.

1134
01:06:59,810 --> 01:07:01,520
That's your batch.

1135
01:07:01,520 --> 01:07:03,890
And you do the same thing.

1136
01:07:03,890 --> 01:07:07,140
But now do it with high
probability analysis.

1137
01:07:07,140 --> 01:07:09,860
But again, because
the expectation is now

1138
01:07:09,860 --> 01:07:13,250
bigger than log,
expect there to be

1139
01:07:13,250 --> 01:07:15,470
a lot of fairly long runs here.

1140
01:07:15,470 --> 01:07:18,440
But that's OK, because
on average is good.

1141
01:07:21,696 --> 01:07:23,820
You expect to pay log to
the 1 plus epsilon for log

1142
01:07:23,820 --> 01:07:25,790
to the 1 plus epsilon queries.

1143
01:07:25,790 --> 01:07:30,860
And so then you divide and
amortize and you're done.

1144
01:07:30,860 --> 01:07:33,490
It's a little bit more
details in the notes about

1145
01:07:33,490 --> 01:07:36,020
that if you want to read.

1146
01:07:36,020 --> 01:07:40,390
I want to do one more
topic, unless there are

1147
01:07:40,390 --> 01:07:43,700
questions about linear probing.

1148
01:07:43,700 --> 01:07:44,957
So, yeah?

1149
01:07:44,957 --> 01:07:49,727
AUDIENCE: So, could you motivate
why the [INAUDIBLE] value of mu

1150
01:07:49,727 --> 01:07:52,600
is the mean for
whatever quantity?

1151
01:07:52,600 --> 01:07:55,250
ERIK DEMAINE: So mu is defined
to be the mean of whatever

1152
01:07:55,250 --> 01:07:56,780
quantity we're analyzing.

1153
01:07:56,780 --> 01:08:00,410
And the Chernoff bounds
says, probability

1154
01:08:00,410 --> 01:08:03,320
that you're at least
something times the mean is

1155
01:08:03,320 --> 01:08:05,250
the formula we wrote last time.

1156
01:08:05,250 --> 01:08:09,270
Now here, we're measuring--

1157
01:08:09,270 --> 01:08:11,652
I didn't write what
the left-hand side was.

1158
01:08:11,652 --> 01:08:13,610
But here we're measuring
what's the probability

1159
01:08:13,610 --> 01:08:16,370
that the number of keys that
hash via h to the interval

1160
01:08:16,370 --> 01:08:19,050
is at least 2/3 the
length of the interval.

1161
01:08:19,050 --> 01:08:25,760
Now, let's say m equals 3m then
the expected number of keys

1162
01:08:25,760 --> 01:08:28,590
that hash via h to interval
is 1/3 times the length

1163
01:08:28,590 --> 01:08:29,899
of the interval.

1164
01:08:29,899 --> 01:08:32,149
Because we have a
totally random thing,

1165
01:08:32,149 --> 01:08:35,420
and we have a density
of 1/3 overall.

1166
01:08:35,420 --> 01:08:38,270
So you expect there
to be 1/3 and so

1167
01:08:38,270 --> 01:08:42,380
dangerous is when you're
more than twice that.

1168
01:08:42,380 --> 01:08:44,000
And so it's twice mu.

1169
01:08:44,000 --> 01:08:46,470
Mu is, in this case, 1/3
the length the interval.

1170
01:08:46,470 --> 01:08:48,482
And that's why I wrote that.

1171
01:08:48,482 --> 01:08:50,294
AUDIENCE: So this comes
from the m squared.

1172
01:08:50,294 --> 01:08:50,750
[INAUDIBLE]

1173
01:08:50,750 --> 01:08:52,208
ERIK DEMAINE: Yeah,
it comes from m

1174
01:08:52,208 --> 01:08:54,482
equals 3m and totally random.

1175
01:08:54,482 --> 01:08:58,180
AUDIENCE: [INAUDIBLE]

1176
01:08:58,180 --> 01:09:00,720
ERIK DEMAINE: Yeah, OK
let's make this equal.

1177
01:09:00,720 --> 01:09:03,600
Make this more formal.

1178
01:09:03,600 --> 01:09:07,359
It's an assumption, anyway,
to simplify the proof.

1179
01:09:07,359 --> 01:09:08,481
Good.

1180
01:09:08,481 --> 01:09:09,689
Change that in the notes too.

1181
01:09:16,090 --> 01:09:16,590
Cool.

1182
01:09:16,590 --> 01:09:18,300
So then the
expectation is exactly

1183
01:09:18,300 --> 01:09:19,979
1/3 instead of at most 1/3.

1184
01:09:19,979 --> 01:09:21,330
So it's all a little cleaner.

1185
01:09:21,330 --> 01:09:24,420
Of course, this all works
when m is at least 1

1186
01:09:24,420 --> 01:09:26,580
plus epsilon times
n, but then you

1187
01:09:26,580 --> 01:09:28,590
get a dependence on epsilon.

1188
01:09:28,590 --> 01:09:32,189
Other questions?

1189
01:09:32,189 --> 01:09:36,806
So bottom line is linear
probing is actually good.

1190
01:09:36,806 --> 01:09:39,180
Quadratic probing, double
hashing, all those fancy things

1191
01:09:39,180 --> 01:09:40,859
are also good.

1192
01:09:40,859 --> 01:09:42,990
But they're really
tuned for the case

1193
01:09:42,990 --> 01:09:44,399
when your table is almost full.

1194
01:09:44,399 --> 01:09:46,210
They get a better
dependence on epsilon,

1195
01:09:46,210 --> 01:09:49,800
which is how close
to the bound you are.

1196
01:09:49,800 --> 01:09:53,040
And so if you're constant
factor away from space bound,

1197
01:09:53,040 --> 01:09:54,660
linear probing is just fine.

1198
01:09:54,660 --> 01:09:57,600
As long as you have enough
independence, admittedly.

1199
01:09:57,600 --> 01:10:01,080
Double hashing, I
believe, gets around that.

1200
01:10:01,080 --> 01:10:07,030
It does not need so
much independence.

1201
01:10:07,030 --> 01:10:08,640
OK.

1202
01:10:08,640 --> 01:10:10,481
Instead of going
to double hashing,

1203
01:10:10,481 --> 01:10:12,980
I'm going to go to something
kind of related double hashing,

1204
01:10:12,980 --> 01:10:13,980
which is cuckoo hashing.

1205
01:10:25,340 --> 01:10:29,070
Cuckoo hashing is a weird idea.

1206
01:10:29,070 --> 01:10:32,870
It's kind of a more extreme
form of perfect hashing.

1207
01:10:32,870 --> 01:10:41,420
It says, look, perfect
hashing did two hash queries.

1208
01:10:41,420 --> 01:10:45,620
So I did one hash evaluation
and another hash evaluation

1209
01:10:45,620 --> 01:10:48,680
followed it, which is OK.

1210
01:10:51,770 --> 01:10:57,560
But again, I want my queries to
only do two things, two probes.

1211
01:10:57,560 --> 01:11:07,090
So it's going to take
that concept of just two

1212
01:11:07,090 --> 01:11:09,460
and actually use
two hash tables.

1213
01:11:09,460 --> 01:11:14,080
So you've got B over here,
I've got A over here.

1214
01:11:19,270 --> 01:11:23,740
And if you have a
key x, you hash it

1215
01:11:23,740 --> 01:11:27,300
to a particular spot in
A via g, and you hash it

1216
01:11:27,300 --> 01:11:31,630
to a particular spot in B via
H. So you have two hash tables,

1217
01:11:31,630 --> 01:11:32,680
two hash functions.

1218
01:11:42,990 --> 01:11:50,880
To do a query you
look at A of g of x,

1219
01:11:50,880 --> 01:11:55,890
and you look at B of h of x.

1220
01:11:55,890 --> 01:11:57,860
Oh sorry, I forgot to mention.

1221
01:11:57,860 --> 01:11:59,610
The other great thing
about linear probing

1222
01:11:59,610 --> 01:12:01,740
is that it's cache
performance is so great.

1223
01:12:01,740 --> 01:12:04,440
This is why it runs
so fast in practice.

1224
01:12:04,440 --> 01:12:06,510
Why it's only 10% slower
than a memory access.

1225
01:12:06,510 --> 01:12:09,810
Because once you
access a single slot,

1226
01:12:09,810 --> 01:12:13,230
whole you get B slots in
a cache with block size B.

1227
01:12:13,230 --> 01:12:17,990
So most of the time, because
your runs are very short,

1228
01:12:17,990 --> 01:12:20,310
you will find your
answer immediately.

1229
01:12:20,310 --> 01:12:22,300
So that's why we kind
of prefer linear probing

1230
01:12:22,300 --> 01:12:24,050
in practice over all
the other schemes I'm

1231
01:12:24,050 --> 01:12:26,010
going to talk about.

1232
01:12:26,010 --> 01:12:28,230
Well, cuckoo
hashing is all right

1233
01:12:28,230 --> 01:12:31,740
because it's only going to look
at two places and that's it.

1234
01:12:31,740 --> 01:12:33,330
Doesn't go anywhere else.

1235
01:12:36,980 --> 01:12:41,172
I guess with perfect hashing
the thing is you have

1236
01:12:41,172 --> 01:12:42,380
more than two hash functions.

1237
01:12:42,380 --> 01:12:43,730
You have the first
hash function which

1238
01:12:43,730 --> 01:12:44,938
sends you to the first table.

1239
01:12:44,938 --> 01:12:46,940
Then you look up a
second hash function.

1240
01:12:46,940 --> 01:12:51,012
Using that hash function
you rehash your value x.

1241
01:12:51,012 --> 01:12:52,970
Downside of that is you
can't compare those two

1242
01:12:52,970 --> 01:12:54,810
hash functions in parallel.

1243
01:12:54,810 --> 01:12:57,020
So if you're like
two cores, you could

1244
01:12:57,020 --> 01:12:59,030
compute these two in
parallel, look them

1245
01:12:59,030 --> 01:13:00,500
both up simultaneously.

1246
01:13:00,500 --> 01:13:02,720
So in that sense you
save a factor of 2

1247
01:13:02,720 --> 01:13:03,931
with some parallelism.

1248
01:13:07,160 --> 01:13:12,380
Now, the weird thing is
the way we do an insertion.

1249
01:13:12,380 --> 01:13:22,950
You try to put it in the
A slot, or the B slot.

1250
01:13:22,950 --> 01:13:26,510
If either of them is
empty you're golden.

1251
01:13:26,510 --> 01:13:28,010
If neither of them
are empty, you've

1252
01:13:28,010 --> 01:13:31,400
got to kick out whoever's there.

1253
01:13:31,400 --> 01:13:41,030
So let's say if you kicked
out y from it's A slot.

1254
01:13:44,360 --> 01:13:47,660
So we ended up
putting x in this one,

1255
01:13:47,660 --> 01:13:52,160
so we end up kicking y
from wherever it belonged.

1256
01:13:52,160 --> 01:13:59,750
Then you move it to B of h of y.

1257
01:13:59,750 --> 01:14:02,210
There's only one other
place that that item can go,

1258
01:14:02,210 --> 01:14:05,060
so you put it there instead.

1259
01:14:05,060 --> 01:14:11,900
In general, I think about A key
it has two places it can go.

1260
01:14:11,900 --> 01:14:13,670
There's some slot in
A, some slot in B.

1261
01:14:13,670 --> 01:14:17,040
You can think of this as an
edge in a bipartite graph.

1262
01:14:17,040 --> 01:14:19,760
So make vertices
for the A slots,

1263
01:14:19,760 --> 01:14:22,190
vertices for the B slots.

1264
01:14:22,190 --> 01:14:25,732
Each edge is an item on a key.

1265
01:14:25,732 --> 01:14:28,880
Key Can only live one
spot in A, one spot in B

1266
01:14:28,880 --> 01:14:31,890
for this query to work.

1267
01:14:31,890 --> 01:14:34,820
So what's happening
is if both of these

1268
01:14:34,820 --> 01:14:37,880
are full you take
whoever is currently here

1269
01:14:37,880 --> 01:14:41,510
and put them over in
their corresponding slot

1270
01:14:41,510 --> 01:14:43,360
over in B. Now, that
one might be full,

1271
01:14:43,360 --> 01:14:45,484
which means you've got to
kick that guy to wherever

1272
01:14:45,484 --> 01:14:47,930
he belongs in A, and so on.

1273
01:14:47,930 --> 01:14:51,380
If eventually you find an
empty slot, great, you're done.

1274
01:14:51,380 --> 01:14:55,040
Just chain reaction
of cuckoo steps

1275
01:14:55,040 --> 01:14:57,680
where the bird's going
from in and out, or from A

1276
01:14:57,680 --> 01:15:01,010
to B, vice a versa.

1277
01:15:01,010 --> 01:15:03,020
If it terminates, you're happy.

1278
01:15:03,020 --> 01:15:05,570
It doesn't terminate,
you're in trouble

1279
01:15:05,570 --> 01:15:09,890
because you might get a cycle,
or a few failure situations.

1280
01:15:09,890 --> 01:15:11,240
In that case you're screwed.

1281
01:15:11,240 --> 01:15:13,100
There is no cuckoo
hash table that

1282
01:15:13,100 --> 01:15:14,690
works for your set of keys.

1283
01:15:14,690 --> 01:15:16,820
In that case, you pick
another hash function,

1284
01:15:16,820 --> 01:15:19,160
rebuild from scratch.

1285
01:15:19,160 --> 01:15:20,840
So it's kind of a
weird hashing scheme

1286
01:15:20,840 --> 01:15:24,470
because it can
fail catastrophic.

1287
01:15:24,470 --> 01:15:26,510
Fortunately, it doesn't
happen too often.

1288
01:15:33,640 --> 01:15:35,810
It still rubs me a funny way.

1289
01:15:35,810 --> 01:15:37,910
I don't know what
to say about it.

1290
01:15:41,450 --> 01:15:44,390
OK, so you lose a
factor of 2 in space.

1291
01:15:47,910 --> 01:15:50,360
2 deterministic
probes for a query.

1292
01:15:50,360 --> 01:15:53,225
That's good news.

1293
01:15:58,850 --> 01:16:01,760
All right, now we get
to, what about updates?

1294
01:16:01,760 --> 01:16:15,510
So if it's fully random
or log n-wise independent,

1295
01:16:15,510 --> 01:16:22,110
then you get a constant expected
update, which is what we want.

1296
01:16:22,110 --> 01:16:23,520
Even with the rebuilding cost.

1297
01:16:23,520 --> 01:16:28,290
So you'll have to rebuild about
every n squared insertions

1298
01:16:28,290 --> 01:16:30,150
you do.

1299
01:16:30,150 --> 01:16:35,670
The way they say this
is there's a 1 over n

1300
01:16:35,670 --> 01:16:37,883
build failure probability.

1301
01:16:42,050 --> 01:16:44,530
There's a 1 over n chance
that your key set will

1302
01:16:44,530 --> 01:16:47,856
be completely unsustainable.

1303
01:16:47,856 --> 01:16:50,230
If you want to put all n keys
into this table there's a 1

1304
01:16:50,230 --> 01:16:52,750
over n chance that
it will be impossible

1305
01:16:52,750 --> 01:16:54,430
and then you have to start over.

1306
01:16:54,430 --> 01:16:58,360
So amortize per insertion,
that's about 1 over n squared.

1307
01:16:58,360 --> 01:17:00,960
Insertions you can do before
the whole thing falls apart

1308
01:17:00,960 --> 01:17:03,130
and you have to rebuild.

1309
01:17:03,130 --> 01:17:04,990
So it's definitely
going to be this should

1310
01:17:04,990 --> 01:17:08,120
be amortize expected, I guess.

1311
01:17:08,120 --> 01:17:10,780
However you want
to think about it.

1312
01:17:10,780 --> 01:17:14,920
But it's another way to do
constant amortized expected.

1313
01:17:14,920 --> 01:17:15,940
Cool.

1314
01:17:15,940 --> 01:17:22,000
The other thing that's known
is that six-wise independence

1315
01:17:22,000 --> 01:17:24,190
is not enough.

1316
01:17:24,190 --> 01:17:26,650
This was actually a
project in this class,

1317
01:17:26,650 --> 01:17:30,270
I believe the first time
it was offered in 2003.

1318
01:17:30,270 --> 01:17:33,850
Six-wise independence
is not sufficient to get

1319
01:17:33,850 --> 01:17:35,560
constant expected bound.

1320
01:17:38,110 --> 01:17:41,770
It will actually fail
with high probability

1321
01:17:41,770 --> 01:17:43,510
if you only have
six-wise independence.

1322
01:17:43,510 --> 01:17:46,780
What's not known is, do you
need constant Independence?

1323
01:17:46,780 --> 01:17:47,850
Or log n independence?

1324
01:17:47,850 --> 01:17:50,830
With log n, very low
failure probability.

1325
01:17:50,830 --> 01:17:54,250
With six-wise, high
probability you fail.

1326
01:17:54,250 --> 01:17:58,711
Like, you fail with
probability 1 minus 1 over n.

1327
01:17:58,711 --> 01:17:59,210
Not so good.

1328
01:18:04,530 --> 01:18:07,610
Some good news is simple
tabulation hashing.

1329
01:18:13,230 --> 01:18:23,060
Means you will fail to build
with probability not 1 over n,

1330
01:18:23,060 --> 01:18:26,000
but 1 over n to the 1/3 power.

1331
01:18:30,340 --> 01:18:31,190
And this is theta.

1332
01:18:31,190 --> 01:18:33,470
This is tight.

1333
01:18:33,470 --> 01:18:34,850
It's almost as good as this.

1334
01:18:34,850 --> 01:18:36,470
We really only
need constant here.

1335
01:18:36,470 --> 01:18:38,940
This is to build
the entire table.

1336
01:18:38,940 --> 01:18:41,400
So in this case you can
insert like n to the 4

1337
01:18:41,400 --> 01:18:43,910
those items before your
table self-destructs.

1338
01:18:43,910 --> 01:18:47,940
So simple tabulation hashing
is, again, pretty good.

1339
01:18:47,940 --> 01:18:51,310
That's I think
the hardest result

1340
01:18:51,310 --> 01:18:53,000
in this paper from last year.

1341
01:18:59,930 --> 01:19:03,170
So I do have a
proof of this one.

1342
01:19:07,107 --> 01:19:07,940
Something like that.

1343
01:19:07,940 --> 01:19:09,710
Or part of a proof.

1344
01:19:09,710 --> 01:19:13,130
So me give you a rough
idea how this works.

1345
01:19:13,130 --> 01:19:18,620
So if you're a fully
random hash function.

1346
01:19:18,620 --> 01:19:23,630
The main concern is that what
if this path is really long.

1347
01:19:23,630 --> 01:19:36,760
I claim that if an insert
follows a path of length k,

1348
01:19:36,760 --> 01:19:41,840
or the probability
of this happening,

1349
01:19:41,840 --> 01:19:44,320
is actually at most
1 over 2 to the k.

1350
01:19:44,320 --> 01:19:45,400
It's very small.

1351
01:19:45,400 --> 01:19:47,130
Exponentially small in k.

1352
01:19:50,590 --> 01:19:53,170
I just want to sketch how
this works because it's

1353
01:19:53,170 --> 01:19:58,630
a cool argument that's actually
in this simple tabulation

1354
01:19:58,630 --> 01:19:59,802
paper.

1355
01:19:59,802 --> 01:20:01,010
So the idea is the following.

1356
01:20:01,010 --> 01:20:04,760
You have some really long path.

1357
01:20:04,760 --> 01:20:07,870
What I'm going to
give you is a way

1358
01:20:07,870 --> 01:20:13,450
to encode the hash functions.

1359
01:20:13,450 --> 01:20:16,300
There's hash functions g and h.

1360
01:20:16,300 --> 01:20:20,380
Each of them has n values.

1361
01:20:20,380 --> 01:20:24,730
Each of those values
is log m bits.

1362
01:20:24,730 --> 01:20:27,300
So if I just wrote them
down the obvious way,

1363
01:20:27,300 --> 01:20:31,840
it's 2 n log m bits to write
down those hash functions.

1364
01:20:31,840 --> 01:20:34,300
Now we're assuming these
are totally random hash

1365
01:20:34,300 --> 01:20:36,955
functions, which means
you need this many bits.

1366
01:20:36,955 --> 01:20:41,020
But I claim that if you
follow a path of length k,

1367
01:20:41,020 --> 01:20:43,690
I can find a new
encoding scheme, a way

1368
01:20:43,690 --> 01:20:48,950
to write down g and h
that is basically minus k.

1369
01:20:48,950 --> 01:20:50,920
This many bits minus k.

1370
01:20:50,920 --> 01:20:52,750
I get to save k bits.

1371
01:20:52,750 --> 01:20:54,730
Now, it turns out
that can happen

1372
01:20:54,730 --> 01:20:57,480
but it happens only with
probability 1 over 2 to the k.

1373
01:20:57,480 --> 01:20:59,920
This is an information
theoretic argument.

1374
01:20:59,920 --> 01:21:01,480
You might get lucky.

1375
01:21:01,480 --> 01:21:04,690
And the g's and h's
you're trying to encode

1376
01:21:04,690 --> 01:21:07,851
can be done with fewer
bits, k fewer bits.

1377
01:21:07,851 --> 01:21:10,350
But that will only happen with
probability 1 over 2 to the k

1378
01:21:10,350 --> 01:21:12,580
if g and h are totally random.

1379
01:21:12,580 --> 01:21:14,040
So how do you do it?

1380
01:21:14,040 --> 01:21:17,740
Basically, I want
to encode the things

1381
01:21:17,740 --> 01:21:20,200
on the path slightly cheaper.

1382
01:21:20,200 --> 01:21:24,350
I'm going to save one
bit per node on the path.

1383
01:21:24,350 --> 01:21:26,450
So what do I need to do?

1384
01:21:26,450 --> 01:21:33,220
Well, the idea is OK, I
will start by writing down

1385
01:21:33,220 --> 01:21:34,730
this hash value.

1386
01:21:34,730 --> 01:21:39,549
This takes log m bits to
write down that hash value.

1387
01:21:39,549 --> 01:21:41,090
Then I'll write down
this hash value.

1388
01:21:41,090 --> 01:21:42,690
That takes a log
m bit, log m bits.

1389
01:21:42,690 --> 01:21:46,150
Generally there's going to be
roughly k log n to write down

1390
01:21:46,150 --> 01:21:48,520
all of the node hash values.

1391
01:21:48,520 --> 01:21:50,490
Then I need to say
that it's actually

1392
01:21:50,490 --> 01:21:54,940
x, this particular key that
corresponds to this edge.

1393
01:21:54,940 --> 01:21:56,680
So I've got to write that down.

1394
01:21:56,680 --> 01:21:59,170
That's going to
take a log n bits

1395
01:21:59,170 --> 01:22:01,390
to say that x is the
guy for the first edge,

1396
01:22:01,390 --> 01:22:04,215
then y is the key that
corresponds to the second edge

1397
01:22:04,215 --> 01:22:06,190
of the path, then z, then w.

1398
01:22:06,190 --> 01:22:08,680
But nicely, things
are ordered here.

1399
01:22:08,680 --> 01:22:11,800
So it only takes
me to log n, k log

1400
01:22:11,800 --> 01:22:14,420
n to write down all these guys.

1401
01:22:14,420 --> 01:22:19,540
So I get k times
log m plus log n.

1402
01:22:19,540 --> 01:22:37,525
Now if m is 2 times n, this
is k times 2 log m minus 1.

1403
01:22:37,525 --> 01:22:43,130
So I get one bit of savings
per k, per thing in the path.

1404
01:22:43,130 --> 01:22:46,460
Essentially because
it's easier for me

1405
01:22:46,460 --> 01:22:48,090
to write down these
labels to say,

1406
01:22:48,090 --> 01:22:50,777
oh, it's the key x
that's going here.

1407
01:22:50,777 --> 01:22:53,360
Instead of having to write down
slot names all the time, which

1408
01:22:53,360 --> 01:22:56,360
cost log m bits, writing
down key names only

1409
01:22:56,360 --> 01:22:58,310
takes log n bits,
which is a savings

1410
01:22:58,310 --> 01:23:02,000
of 1 bit per thing on the path.

1411
01:23:02,000 --> 01:23:05,120
And so that was a quick
sketch of how this proof goes.

1412
01:23:05,120 --> 01:23:07,600
It's kind of neat,
information theoretic argument

1413
01:23:07,600 --> 01:23:08,960
why the paths can't get long.

1414
01:23:08,960 --> 01:23:12,350
You then have to worry
about cycles and things that

1415
01:23:12,350 --> 01:23:14,170
look like this.

1416
01:23:14,170 --> 01:23:15,500
That's kind of messy.

1417
01:23:15,500 --> 01:23:18,170
But same kind of
argument generalizes.

1418
01:23:18,170 --> 01:23:23,110
So that was your quick overview
of lots of hashing stuff.