1
00:00:00,090 --> 00:00:02,500
The following content is
provided under a Creative

2
00:00:02,500 --> 00:00:04,019
Commons license.

3
00:00:04,019 --> 00:00:06,360
Your support will help
MIT OpenCourseWare

4
00:00:06,360 --> 00:00:10,730
continue to offer high quality
educational resources for free.

5
00:00:10,730 --> 00:00:13,340
To make a donation or
view additional materials

6
00:00:13,340 --> 00:00:17,217
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,217 --> 00:00:17,842
at ocw.mit.edu.

8
00:00:22,420 --> 00:00:25,460
ERIK DEMAINE: All right,
let's get started.

9
00:00:25,460 --> 00:00:29,060
Today we're going to continue
the theme of randomization

10
00:00:29,060 --> 00:00:30,320
and data structures.

11
00:00:30,320 --> 00:00:31,880
Last time we saw skip lists.

12
00:00:31,880 --> 00:00:36,010
Skip lists solve the
predecessor-successor problem.

13
00:00:36,010 --> 00:00:38,080
You can search for an item
and if it's not there,

14
00:00:38,080 --> 00:00:41,660
you get the closest item
on either side in log n

15
00:00:41,660 --> 00:00:43,840
with high probability.

16
00:00:43,840 --> 00:00:46,650
But we already knew how to
do that deterministically.

17
00:00:46,650 --> 00:00:49,400
Today we're going to solve a
slightly different problem,

18
00:00:49,400 --> 00:00:52,000
the dictionary problem
with hash tables.

19
00:00:52,000 --> 00:00:54,610
Something you already
think you know.

20
00:00:54,610 --> 00:00:57,450
But we're going to show you
how much you didn't know.

21
00:00:57,450 --> 00:01:00,080
But after today you will know.

22
00:01:00,080 --> 00:01:04,114
And we're going to get
constant time and not

23
00:01:04,114 --> 00:01:05,030
with high probability.

24
00:01:05,030 --> 00:01:06,110
That's hard.

25
00:01:06,110 --> 00:01:08,640
But we'll do constant
expected time.

26
00:01:08,640 --> 00:01:11,709
So that's in some sense better.

27
00:01:11,709 --> 00:01:13,250
It's going to solve
a weaker problem.

28
00:01:13,250 --> 00:01:16,230
But we're going to get
tighter bound constant instead

29
00:01:16,230 --> 00:01:17,940
of logarithmic.

30
00:01:17,940 --> 00:01:22,830
So for starters let me remind
you what problem we're solving

31
00:01:22,830 --> 00:01:30,770
and the basics of hashing
which you learned in 6006.

32
00:01:30,770 --> 00:01:34,600
I'm going to give this problem
a name because it's important

33
00:01:34,600 --> 00:01:37,880
and we often forget
to distinguish

34
00:01:37,880 --> 00:01:41,210
between two types of things.

35
00:01:44,070 --> 00:01:47,350
This is kind of an old
term, but I would call this

36
00:01:47,350 --> 00:01:51,400
an abstract data type.

37
00:01:51,400 --> 00:01:54,290
This is just the
problem specification

38
00:01:54,290 --> 00:01:56,580
of what you're trying to do.

39
00:01:56,580 --> 00:01:58,604
You might call this an
interface or something.

40
00:01:58,604 --> 00:02:00,895
This is the problem statement
versus the data structure

41
00:02:00,895 --> 00:02:02,220
is how you actually solve it.

42
00:02:02,220 --> 00:02:04,330
The hash tables are
the data structure.

43
00:02:04,330 --> 00:02:08,300
The dictionary is the problem
or the abstract data type.

44
00:02:08,300 --> 00:02:11,480
So what we're
trying to do today,

45
00:02:11,480 --> 00:02:13,530
as in most data
structures, is maintain

46
00:02:13,530 --> 00:02:14,695
a dynamic set of items.

47
00:02:18,150 --> 00:02:20,920
And here I'm going to
distinguish between the items

48
00:02:20,920 --> 00:02:21,695
and their keys.

49
00:02:24,310 --> 00:02:25,857
Each item has a key.

50
00:02:25,857 --> 00:02:27,440
And normally you'd
think of there also

51
00:02:27,440 --> 00:02:29,640
being a value like in Python.

52
00:02:29,640 --> 00:02:31,500
But we're just
worrying about the keys

53
00:02:31,500 --> 00:02:33,760
and moving the items around.

54
00:02:33,760 --> 00:02:36,990
And we want to support
three operations.

55
00:02:39,770 --> 00:02:53,170
We want to be able to insert
an item, delete an item,

56
00:02:53,170 --> 00:02:54,230
and search for an item.

57
00:02:58,920 --> 00:03:02,340
But search is going to
be different from what

58
00:03:02,340 --> 00:03:04,770
we know from AVL trees
or skip lists or even

59
00:03:04,770 --> 00:03:08,820
Venom [INAUDIBLE] That was a
predecessor-successor search.

60
00:03:08,820 --> 00:03:10,779
Here we just want
to know-- sorry,

61
00:03:10,779 --> 00:03:12,070
your not searching for an item.

62
00:03:12,070 --> 00:03:17,760
Usually you're searching
for just a key-- here

63
00:03:17,760 --> 00:03:20,840
you just want to know is
there any item with that key,

64
00:03:20,840 --> 00:03:22,190
and return it.

65
00:03:28,880 --> 00:03:31,980
This is often called
an exact search

66
00:03:31,980 --> 00:03:33,600
because if the key
is not in there,

67
00:03:33,600 --> 00:03:36,310
you learn absolutely nothing.

68
00:03:36,310 --> 00:03:38,400
You can't find the nearest key.

69
00:03:38,400 --> 00:03:41,480
And for whatever reason this
is called a dictionary problem

70
00:03:41,480 --> 00:03:43,150
though it's unlike
a real dictionary.

71
00:03:43,150 --> 00:03:45,820
Usually when you search for a
word you do find its neighbors.

72
00:03:45,820 --> 00:03:48,510
Here we're just going to
either-- if the key's there

73
00:03:48,510 --> 00:03:50,630
we find that, otherwise not.

74
00:03:50,630 --> 00:03:55,930
And this is exactly what a
Python dictionary implements.

75
00:03:55,930 --> 00:04:00,980
So I guess that's why Python
dictionaries are called dicts.

76
00:04:00,980 --> 00:04:07,540
So today I'm going to assume
all items have distinct keys.

77
00:04:07,540 --> 00:04:13,480
So in the insertion I will
assume key is not already

78
00:04:13,480 --> 00:04:14,240
in the table.

79
00:04:17,760 --> 00:04:21,769
With a little bit
of work, you can

80
00:04:21,769 --> 00:04:24,040
allow inserting an item
with an existing key,

81
00:04:24,040 --> 00:04:27,240
and you just overwrite
that existing item.

82
00:04:27,240 --> 00:04:28,990
But I don't want to
worry about that here.

83
00:04:31,540 --> 00:04:34,300
So we could, of course,
solve this using an AVL tree

84
00:04:34,300 --> 00:04:36,010
in log n time.

85
00:04:36,010 --> 00:04:40,720
But our goal is to do better
because it's an easier problem.

86
00:04:40,720 --> 00:04:44,540
And I'm going to remind you
of the simplest way you learn

87
00:04:44,540 --> 00:04:50,950
to do this which was hashing
with chaining in 006.

88
00:04:50,950 --> 00:04:57,930
And the catch is you didn't
really analyze this in 006.

89
00:04:57,930 --> 00:05:02,880
So we're going make a
constant time per operation.

90
00:05:06,950 --> 00:05:14,570
It's going to be expected or
something and linear space.

91
00:05:18,230 --> 00:05:22,340
And remember the
variables we care

92
00:05:22,340 --> 00:05:27,330
about, there's u, n, and m.

93
00:05:27,330 --> 00:05:28,860
So u is the size
of the universe.

94
00:05:28,860 --> 00:05:30,970
This is the all possible keys.

95
00:05:30,970 --> 00:05:32,490
The space of all possible keys.

96
00:05:38,900 --> 00:05:42,460
n is the size of the set
your currently storing.

97
00:05:42,460 --> 00:05:47,130
So that's the number
of items or keys

98
00:05:47,130 --> 00:05:49,060
currently in the data structure.

99
00:05:54,510 --> 00:05:57,950
And then m is the
size of your table.

100
00:05:57,950 --> 00:06:01,390
So say it's the number
of slots in the table.

101
00:06:04,460 --> 00:06:05,760
So you remember the picture.

102
00:06:05,760 --> 00:06:10,630
You have a table of slots.

103
00:06:10,630 --> 00:06:13,340
Let's say 0 to m minus 1.

104
00:06:13,340 --> 00:06:15,320
Each of them is a
pointer to a linked list.

105
00:06:18,610 --> 00:06:21,140
And if you have,
let's say over here

106
00:06:21,140 --> 00:06:24,660
is your universe of
all possible keys,

107
00:06:24,660 --> 00:06:28,810
then we have a hash function
which maps each universe

108
00:06:28,810 --> 00:06:32,950
item into one of these slots.

109
00:06:32,950 --> 00:06:35,100
And then the linked
list here is storing

110
00:06:35,100 --> 00:06:38,720
all of the items that
hash to that slot.

111
00:06:38,720 --> 00:06:49,810
So we have a hash function
which maps the universe.

112
00:06:49,810 --> 00:06:52,340
I'm going to assume the
universe has already been mapped

113
00:06:52,340 --> 00:06:54,685
into integers 0 to u minus 1.

114
00:06:54,685 --> 00:06:56,390
And it maps to slots.

115
00:07:02,660 --> 00:07:07,280
And when we do
hashing with chaining,

116
00:07:07,280 --> 00:07:09,780
I think I mentioned this
last week, the bounds

117
00:07:09,780 --> 00:07:23,950
you get, we achieve
a bound of 1 plus

118
00:07:23,950 --> 00:07:29,450
alpha where alpha is
the load factor n/m.

119
00:07:29,450 --> 00:07:33,300
The average number of items
you'd expect to hash to a slot

120
00:07:33,300 --> 00:07:36,710
is the number of items divided
by the number of slots.

121
00:07:36,710 --> 00:07:38,810
OK.

122
00:07:38,810 --> 00:07:41,770
And you proved this
in 6006 but you

123
00:07:41,770 --> 00:07:47,200
assumed something called
simple uniform hashing.

124
00:08:05,590 --> 00:08:08,900
Simple uniform hashing
is an assumption,

125
00:08:08,900 --> 00:08:10,750
I think invented for CLRS.

126
00:08:10,750 --> 00:08:13,380
It makes the
analysis very simple,

127
00:08:13,380 --> 00:08:15,480
but it's also
basically cheating.

128
00:08:15,480 --> 00:08:17,820
So today our goal
is to not cheat.

129
00:08:17,820 --> 00:08:19,690
It's nice as a warm up.

130
00:08:19,690 --> 00:08:21,520
But we don't like cheating.

131
00:08:21,520 --> 00:08:34,270
So you may recall the assumption
is about the hash function.

132
00:08:34,270 --> 00:08:37,740
You want a good hash function.

133
00:08:37,740 --> 00:08:43,080
And good means this.

134
00:08:43,080 --> 00:08:46,480
I want the probability
of two distinct keys

135
00:08:46,480 --> 00:08:51,520
mapping to the same slot to
be 1/m if there are m slots.

136
00:08:51,520 --> 00:08:53,430
If everything was
completely random,

137
00:08:53,430 --> 00:08:57,290
if h was basically choosing a
random number for every key,

138
00:08:57,290 --> 00:09:00,090
then that's what we
would expect to happen.

139
00:09:00,090 --> 00:09:02,550
So this is like the
idealized scenario.

140
00:09:02,550 --> 00:09:04,270
Now, we can't have
a hash function

141
00:09:04,270 --> 00:09:07,000
could choosing a random
number for every key

142
00:09:07,000 --> 00:09:09,250
because it has to choose the
same value if you give it

143
00:09:09,250 --> 00:09:10,520
the same key.

144
00:09:10,520 --> 00:09:14,029
So it has to be some kind
of deterministic strategy

145
00:09:14,029 --> 00:09:15,570
or at least repeatable
strategy where

146
00:09:15,570 --> 00:09:18,570
if you plug in the same
key you get the same thing.

147
00:09:18,570 --> 00:09:20,520
So really what this
assumption is saying

148
00:09:20,520 --> 00:09:28,750
is that the key's that you
give are in some sense random.

149
00:09:28,750 --> 00:09:33,100
If I give you random keys
and I have not-too-crazy hash

150
00:09:33,100 --> 00:09:36,460
function then this will be true.

151
00:09:36,460 --> 00:09:39,630
But I don't like assuming
anything about the keys maybe.

152
00:09:39,630 --> 00:09:44,190
I want my keys to
be worst case maybe.

153
00:09:44,190 --> 00:09:47,872
There are lots of examples in
the real world where you apply

154
00:09:47,872 --> 00:09:49,330
some hash function
and it turns out

155
00:09:49,330 --> 00:09:51,660
your data has some very
particular structure.

156
00:09:51,660 --> 00:09:53,260
And if you choose a
bad hash function,

157
00:09:53,260 --> 00:09:56,350
then your hash table
gets really, really slow.

158
00:09:56,350 --> 00:10:00,250
Maybe everything hashes
to the same slot.

159
00:10:00,250 --> 00:10:02,920
Or say you take--
well yeah, there

160
00:10:02,920 --> 00:10:05,440
are lots of examples of that.

161
00:10:05,440 --> 00:10:06,360
We want to avoid that.

162
00:10:06,360 --> 00:10:10,650
After today you will know how to
achieve constant expected time

163
00:10:10,650 --> 00:10:14,580
no matter what your keys
are, for worst case keys.

164
00:10:14,580 --> 00:10:18,140
But it's going to take
some work to do that.

165
00:10:18,140 --> 00:10:26,080
So this assumption
requires assuming

166
00:10:26,080 --> 00:10:28,060
that the keys are random.

167
00:10:32,447 --> 00:10:34,780
And this is what we would
call an average case analysis.

168
00:10:41,296 --> 00:10:43,170
You might think that
average case analysis is

169
00:10:43,170 --> 00:10:45,800
necessary for
randomized algorithms,

170
00:10:45,800 --> 00:10:47,800
but that's not true.

171
00:10:47,800 --> 00:10:50,580
And we saw that last
week with quicksort.

172
00:10:50,580 --> 00:10:54,660
Quicksort, if you say I
will always choose a of 1

173
00:10:54,660 --> 00:10:57,130
to be my partition
element, that's

174
00:10:57,130 --> 00:11:00,770
what the textbook calls
basic quicksort, then

175
00:11:00,770 --> 00:11:03,830
for an average input
that will do really well.

176
00:11:03,830 --> 00:11:07,610
If you have a uniform
random permutation of items

177
00:11:07,610 --> 00:11:10,450
and you sort with the method of
always choosing the first item

178
00:11:10,450 --> 00:11:16,090
as your partition, then that
will be n log n on average

179
00:11:16,090 --> 00:11:18,230
if your data is average.

180
00:11:18,230 --> 00:11:21,270
But we saw we could
avoid that assumption

181
00:11:21,270 --> 00:11:24,000
by choosing a random pivot.

182
00:11:24,000 --> 00:11:25,920
If you choose a
random pivot, then you

183
00:11:25,920 --> 00:11:27,836
don't need to assume
anything about the input.

184
00:11:27,836 --> 00:11:30,060
You just need to assume
that the pivots are random.

185
00:11:30,060 --> 00:11:32,726
So it's a big difference between
assuming your inputs are random

186
00:11:32,726 --> 00:11:35,230
versus assuming your
coin flips are random.

187
00:11:35,230 --> 00:11:39,569
It's pretty reasonable to
assume you can flip coins.

188
00:11:39,569 --> 00:11:41,610
If you've got enough
dexterity in your thumb then

189
00:11:41,610 --> 00:11:43,341
you can do it.

190
00:11:43,341 --> 00:11:44,840
But it's not so
reasonable to assume

191
00:11:44,840 --> 00:11:45,923
that your input is random.

192
00:11:45,923 --> 00:11:50,012
So we'd like to avoid average
case analysis whenever we can,

193
00:11:50,012 --> 00:11:51,220
and that's the goal of today.

194
00:11:51,220 --> 00:11:54,470
So what you saw in 006 was
essentially assuming the inputs

195
00:11:54,470 --> 00:11:55,089
are random.

196
00:11:55,089 --> 00:11:57,630
We're going to get rid of that
unreasonable assumption today.

197
00:12:03,860 --> 00:12:07,780
So that's, in some
sense, review from 006.

198
00:12:07,780 --> 00:12:10,430
I'm going to take a
brief pause and tell you

199
00:12:10,430 --> 00:12:14,220
about the etymology of the word
hash in case you're curious.

200
00:12:14,220 --> 00:12:25,020
Hash is an English word since
the 1650's, so it's pretty old.

201
00:12:25,020 --> 00:12:29,070
It means literally
cut into small pieces.

202
00:12:29,070 --> 00:12:31,520
It's usually used
in a culinary sense,

203
00:12:31,520 --> 00:12:36,412
like these days you have
corned beef hash or something.

204
00:12:36,412 --> 00:12:37,828
I'll put the
definition over here.

205
00:12:45,480 --> 00:13:01,130
It comes from French, hacher,
which means to chop up.

206
00:13:04,320 --> 00:13:08,660
You know it in English
from the word hatchet.

207
00:13:08,660 --> 00:13:10,260
So it's the same derivation.

208
00:13:12,800 --> 00:13:20,520
And it comes from old French--
I don't actually know whether

209
00:13:20,520 --> 00:13:30,916
that's "hash-ay" or "hash"
but-- which means axe.

210
00:13:30,916 --> 00:13:32,165
So you can see the derivation.

211
00:13:35,940 --> 00:13:38,020
If you look this
up in OED or pick

212
00:13:38,020 --> 00:13:41,760
your favorite dictionary or even
Google, that's what you find.

213
00:13:41,760 --> 00:13:45,120
But in fact there's a
new prevailing theory

214
00:13:45,120 --> 00:13:53,570
that in fact hash comes
from another language which

215
00:13:53,570 --> 00:14:01,140
is Vulcan, la'ash, I mean you
can see the derivation right?

216
00:14:01,140 --> 00:14:03,160
Actually means axe.

217
00:14:03,160 --> 00:14:06,800
So maybe French got it
from Vulcan or vice versa

218
00:14:06,800 --> 00:14:10,600
but I think that's pretty clear.

219
00:14:10,600 --> 00:14:13,960
Live long and prosper,
and farewell to Spock.

220
00:14:16,720 --> 00:14:17,865
Sad news of last week.

221
00:14:20,880 --> 00:14:22,770
So enough about hashing.

222
00:14:22,770 --> 00:14:24,870
We'll come back to
that in a little bit.

223
00:14:24,870 --> 00:14:27,050
But hash functions
essentially take up

224
00:14:27,050 --> 00:14:30,410
this idea of taking your
key, chopping up into pieces,

225
00:14:30,410 --> 00:14:35,910
and mixing it like
in a good dish.

226
00:14:35,910 --> 00:14:39,620
All right, so we're going
to cover two ways to get

227
00:14:39,620 --> 00:14:43,186
strong constant time bounds.

228
00:14:43,186 --> 00:14:45,560
Probably the most useful one
is called universal hashing.

229
00:14:45,560 --> 00:14:47,440
We'll spend most of
our time on that.

230
00:14:47,440 --> 00:14:50,379
But the theoretically cooler
one is called perfect hashing.

231
00:14:50,379 --> 00:14:52,170
Universal hashing,
we're going to guarantee

232
00:14:52,170 --> 00:14:54,580
there are very few
conflicts in expectation.

233
00:14:54,580 --> 00:14:56,990
Perfect hashing , we're going
to guarantee there are zero

234
00:14:56,990 --> 00:14:58,240
conflicts.

235
00:14:58,240 --> 00:15:01,570
The catch is, at least
in its obvious form,

236
00:15:01,570 --> 00:15:04,290
it only works for static sets.

237
00:15:04,290 --> 00:15:07,630
If you forbid, insert,
and delete and just want

238
00:15:07,630 --> 00:15:10,894
to do search, then perfect
hashing is a good method.

239
00:15:10,894 --> 00:15:12,310
So like if you're
actually storing

240
00:15:12,310 --> 00:15:15,464
a dictionary, like
the OED, English

241
00:15:15,464 --> 00:15:16,630
doesn't change that quickly.

242
00:15:16,630 --> 00:15:19,650
So you can afford to recompute
your data structure whenever

243
00:15:19,650 --> 00:15:22,285
you release a new edition.

244
00:15:22,285 --> 00:15:23,910
But let's start with
universal hashing.

245
00:15:23,910 --> 00:15:27,600
This is a nice
powerful technique.

246
00:15:27,600 --> 00:15:30,030
It works for dynamic data.

247
00:15:30,030 --> 00:15:32,430
Insert, delete, and
search will be constant

248
00:15:32,430 --> 00:15:36,340
expected time with no
assumptions about the input.

249
00:15:36,340 --> 00:15:37,800
So it will not be average case.

250
00:15:37,800 --> 00:15:40,045
It's in some sense worse
case but randomized.

251
00:15:43,000 --> 00:15:46,320
So the idea is we need
to do something random.

252
00:15:46,320 --> 00:15:48,900
If you just say, well, I
choose one hash function

253
00:15:48,900 --> 00:15:51,240
once and for all, and I
use that for my table,

254
00:15:51,240 --> 00:15:52,750
OK maybe my table
doubles in size

255
00:15:52,750 --> 00:15:54,070
and I change the hash function.

256
00:15:54,070 --> 00:15:57,400
But there's no randomness there.

257
00:15:57,400 --> 00:15:59,740
We need to introduce
randomness somehow

258
00:15:59,740 --> 00:16:01,760
into this data structure.

259
00:16:01,760 --> 00:16:04,090
And the way we're
going to do that

260
00:16:04,090 --> 00:16:07,160
is in how we choose
the hash function.

261
00:16:07,160 --> 00:16:17,580
We're going to choose our
hash function randomly

262
00:16:17,580 --> 00:16:19,745
from some set of hash functions.

263
00:16:19,745 --> 00:16:22,430
Call it h.

264
00:16:22,430 --> 00:16:25,845
This is going to be a
universal hash family.

265
00:16:25,845 --> 00:16:27,970
We're going to imagine
there are many possible hash

266
00:16:27,970 --> 00:16:29,260
functions we could choose.

267
00:16:29,260 --> 00:16:31,780
If we choose one of them
uniformly at random,

268
00:16:31,780 --> 00:16:33,190
that's a random choice.

269
00:16:33,190 --> 00:16:35,500
And that randomness
is going to be enough

270
00:16:35,500 --> 00:16:39,750
that we no longer need to
assume anything about the keys.

271
00:16:39,750 --> 00:16:46,800
So for that to work, we need
some assumption about h.

272
00:16:46,800 --> 00:16:48,900
Maybe it's just a set
of one hash function.

273
00:16:48,900 --> 00:16:50,600
That wouldn't add
much randomness.

274
00:16:50,600 --> 00:16:52,540
Two also would not
add much randomness.

275
00:16:52,540 --> 00:16:54,090
We need a lot of them.

276
00:16:54,090 --> 00:16:56,340
And so we're going to require
H to have this property.

277
00:16:59,070 --> 00:17:02,040
And we're going to call it
the property universality.

278
00:17:04,770 --> 00:17:07,785
Generally you would call
it a universal hash family.

279
00:17:11,589 --> 00:17:14,920
Just a set of hash functions.

280
00:17:14,920 --> 00:17:21,390
What we want is that-- so we're
choosing our hash function

281
00:17:21,390 --> 00:17:25,490
h from H. And
among those choices

282
00:17:25,490 --> 00:17:31,180
we want the probability that
two keys hash to the same value

283
00:17:31,180 --> 00:17:31,910
to be small.

284
00:17:42,520 --> 00:17:51,610
I'll say-- and this
is very similar

285
00:17:51,610 --> 00:17:55,380
looking to simple
uniform hashing.

286
00:17:55,380 --> 00:17:59,130
Looks almost the same here
except I switched from k1

287
00:17:59,130 --> 00:18:03,470
and k2 to k and
k', but same thing.

288
00:18:03,470 --> 00:18:05,770
But what we're taking
the probability over,

289
00:18:05,770 --> 00:18:08,190
what we're assuming is
random is different.

290
00:18:08,190 --> 00:18:12,520
Here we're assuming k1 and
k2 a are because h was fixed.

291
00:18:12,520 --> 00:18:15,820
This was an assumption
about the inputs.

292
00:18:15,820 --> 00:18:19,970
Over here we're thinking
of k and k' as being fixed.

293
00:18:19,970 --> 00:18:23,400
This has to work for every
pair of distinct keys.

294
00:18:23,400 --> 00:18:25,400
And the probability
we're considering

295
00:18:25,400 --> 00:18:28,010
is the distribution of h.

296
00:18:28,010 --> 00:18:31,470
So we're trying all the
different h's Or we're trying

297
00:18:31,470 --> 00:18:33,330
little h uniformly at random.

298
00:18:33,330 --> 00:18:37,730
We want the probability that a
random h makes k and k' collide

299
00:18:37,730 --> 00:18:40,160
to be at most 1/m.

300
00:18:40,160 --> 00:18:44,030
The other difference is we
switch from equals to at most.

301
00:18:44,030 --> 00:18:45,900
I mean less would be better.

302
00:18:45,900 --> 00:18:48,160
And there are ways to make
it less for a couple pairs

303
00:18:48,160 --> 00:18:50,310
but it doesn't really matter.

304
00:18:50,310 --> 00:18:52,310
But of course anything
less than or equal to 1/m

305
00:18:52,310 --> 00:18:55,170
will be just as good.

306
00:18:55,170 --> 00:18:57,960
So this is an
assumption about H.

307
00:18:57,960 --> 00:19:00,700
We'll see how to achieve this
assumption in a little bit.

308
00:19:00,700 --> 00:19:04,510
Let me first prove to
you that this is enough.

309
00:19:04,510 --> 00:19:09,280
It's going to be basically
the same as the 006 analysis.

310
00:19:09,280 --> 00:19:13,870
But it's worth repeating just
so we are sure everything's OK.

311
00:19:20,230 --> 00:19:24,835
And so I can be more precise
about what we're assuming.

312
00:19:39,240 --> 00:19:42,600
The key difference between this
theorem and the 006 theorem is

313
00:19:42,600 --> 00:19:44,740
we get to make no
assumptions about the keys.

314
00:19:44,740 --> 00:19:45,790
They are arbitrary.

315
00:19:45,790 --> 00:19:48,550
You get to choose
them however you want.

316
00:19:48,550 --> 00:19:51,680
But then I choose a
random hash function.

317
00:19:51,680 --> 00:19:54,160
The hash function cannot
depend on these keys.

318
00:19:54,160 --> 00:19:55,980
But it's going to be random.

319
00:19:55,980 --> 00:19:59,860
And I choose the hash function
after you choose the keys.

320
00:19:59,860 --> 00:20:01,275
That's important.

321
00:20:07,570 --> 00:20:11,430
So we're going to
choose a random h and H.

322
00:20:11,430 --> 00:20:14,110
And we're assuming
H is universal.

323
00:20:19,060 --> 00:20:35,680
Then the expected number of keys
in a slot among those n keys

324
00:20:35,680 --> 00:20:39,600
is at most 1 plus alpha.

325
00:20:39,600 --> 00:20:41,260
Alpha is n/m.

326
00:20:41,260 --> 00:20:45,470
So this is exactly
what we had over here.

327
00:20:45,470 --> 00:20:47,560
Here we're talking
about time bound.

328
00:20:47,560 --> 00:20:49,720
But the time bound
followed because the length

329
00:20:49,720 --> 00:20:53,630
of each chain was expected
to be 1 plus alpha.

330
00:20:53,630 --> 00:20:57,136
And here the expectation
is over the choice of h.

331
00:20:57,136 --> 00:21:01,910
Not assuming anything
about the keys.

332
00:21:01,910 --> 00:21:04,010
So let's prove this theorem.

333
00:21:08,920 --> 00:21:09,660
It's pretty easy.

334
00:21:09,660 --> 00:21:12,170
But I'm going to introduce
some analysis techniques

335
00:21:12,170 --> 00:21:16,880
that we will use for
more interesting things.

336
00:21:16,880 --> 00:21:20,280
So let's give the keys a name.

337
00:21:20,280 --> 00:21:28,380
I'll just call
them-- I'll be lazy.

338
00:21:28,380 --> 00:21:30,965
Use k1 up to kn.

339
00:21:35,680 --> 00:21:41,100
And I just want to
compute that expectation.

340
00:21:53,620 --> 00:22:01,820
So I want to compute let's say
the number of keys colliding

341
00:22:01,820 --> 00:22:06,000
with one of those
keys, let's say ki.

342
00:22:12,840 --> 00:22:16,760
So this is of course the size of
the slot that ki happens to go.

343
00:22:16,760 --> 00:22:18,220
This is going to work for all i.

344
00:22:18,220 --> 00:22:22,200
And so if I can say that this
is at most 1/alpha for each i,

345
00:22:22,200 --> 00:22:23,920
then I have my theorem.

346
00:22:23,920 --> 00:22:25,760
Just another way
to talk about it.

347
00:22:25,760 --> 00:22:29,100
Now the number of keys
colliding with ki, here's

348
00:22:29,100 --> 00:22:32,100
a general trick, whenever
you want to count something

349
00:22:32,100 --> 00:22:34,820
in expectation, a
very helpful tool

350
00:22:34,820 --> 00:22:37,850
is indicator random variables.

351
00:22:37,850 --> 00:22:42,530
Let's name all of the different
events that we want to count.

352
00:22:42,530 --> 00:22:45,860
And then we're basically
summing those variables.

353
00:22:45,860 --> 00:22:53,514
So I'm going to say-- I'm going
to use I ij to be an indicator

354
00:22:53,514 --> 00:22:54,180
random variable.

355
00:22:54,180 --> 00:22:56,320
It's going to be 1 or 0.

356
00:22:56,320 --> 00:23:06,665
1 if hash function of ki
equals the hash function of kj.

357
00:23:06,665 --> 00:23:10,790
So there's a collision
between ki and kj j and 0

358
00:23:10,790 --> 00:23:12,280
if they hash to different slots.

359
00:23:14,830 --> 00:23:17,660
Now this is, it's a random
variable because it depends

360
00:23:17,660 --> 00:23:19,930
on h and h is a random thing.

361
00:23:19,930 --> 00:23:22,020
ki and kj are not random.

362
00:23:22,020 --> 00:23:24,290
They're given to you.

363
00:23:24,290 --> 00:23:28,620
And then I want to know
when does h back those two

364
00:23:28,620 --> 00:23:30,660
keys to the same slot.

365
00:23:30,660 --> 00:23:39,070
And so this number is really
just the sum of Iij over all j.

366
00:23:39,070 --> 00:23:42,150
This is the same thing.

367
00:23:42,150 --> 00:23:50,620
The number in here is the sum
for j not equal to i of Iij.

368
00:23:50,620 --> 00:23:53,730
Because we get a 1 every time
they collide, zero otherwise.

369
00:23:53,730 --> 00:23:57,170
So that counts how many collide.

370
00:23:57,170 --> 00:23:58,870
Once we have it
in this notation,

371
00:23:58,870 --> 00:24:02,600
we can use all the great
dilemmas and theorems

372
00:24:02,600 --> 00:24:06,600
about in this case,
E, expectation.

373
00:24:06,600 --> 00:24:07,580
What should I use here?

374
00:24:10,442 --> 00:24:12,236
STUDENT: What?

375
00:24:12,236 --> 00:24:13,860
ERIK DEMAINE: What's
a good-- how can I

376
00:24:13,860 --> 00:24:14,970
simplify this formula?

377
00:24:14,970 --> 00:24:16,625
STUDENT: The linearity
of expectation.

378
00:24:16,625 --> 00:24:16,950
ERIK DEMAINE: The
linearity of expectation.

379
00:24:16,950 --> 00:24:17,450
Thank you.

380
00:24:20,170 --> 00:24:21,640
If you don't know
all these things,

381
00:24:21,640 --> 00:24:25,340
read the probability
appendix in the textbook.

382
00:24:25,340 --> 00:24:29,280
So we want to talk
about expectation

383
00:24:29,280 --> 00:24:31,450
of the simplest thing possible.

384
00:24:31,450 --> 00:24:35,910
So linearity let's us
put the E inside the sum

385
00:24:35,910 --> 00:24:37,710
without losing anything.

386
00:24:37,710 --> 00:24:41,600
Now the expectation of an
indicator random variable

387
00:24:41,600 --> 00:24:44,050
is pretty simple
because the zeros don't

388
00:24:44,050 --> 00:24:45,810
contribute to the expectation.

389
00:24:45,810 --> 00:24:47,310
The 1's contribute 1.

390
00:24:47,310 --> 00:24:50,140
So this is the same thing
as just the probability

391
00:24:50,140 --> 00:24:51,971
of this being 1.

392
00:24:51,971 --> 00:24:59,700
So we get sum of j9 equal
to I of the probability

393
00:24:59,700 --> 00:25:04,980
that Iij equals 1.

394
00:25:04,980 --> 00:25:07,270
And the probability
that Iij equals 1,

395
00:25:07,270 --> 00:25:11,570
well, that's the probability
that this happens.

396
00:25:11,570 --> 00:25:15,010
And what's the probability
that that happens?

397
00:25:15,010 --> 00:25:18,520
At most 1/m our universality.

398
00:25:18,520 --> 00:25:22,665
So I'm going to--
I'll write it out.

399
00:25:22,665 --> 00:25:26,450
This is sum j not
equal to I. Probability

400
00:25:26,450 --> 00:25:31,420
that h maps ki and
kj to the same slot.

401
00:25:34,870 --> 00:25:37,340
So that's the definition of Iij.

402
00:25:37,340 --> 00:25:41,950
And this is at most
sum j not equal to i

403
00:25:41,950 --> 00:25:44,569
of 1/m by universality.

404
00:25:44,569 --> 00:25:45,860
So here's where we're using it.

405
00:25:50,450 --> 00:25:56,030
And sum of j not equal to
I, well that's basically n.

406
00:26:04,920 --> 00:26:07,190
But I made a mistake here.

407
00:26:07,190 --> 00:26:08,960
Slightly off.

408
00:26:08,960 --> 00:26:11,410
From here-- yeah.

409
00:26:11,410 --> 00:26:14,450
So this line is wrong.

410
00:26:14,450 --> 00:26:14,950
Sorry.

411
00:26:14,950 --> 00:26:16,130
Let me fix it.

412
00:26:16,130 --> 00:26:18,430
Because this
assumption only works

413
00:26:18,430 --> 00:26:20,810
when the keys are distinct.

414
00:26:20,810 --> 00:26:32,070
So in fact-- how did I get
j-- yeah. , Yeah, sorry.

415
00:26:32,070 --> 00:26:34,290
This should have been
this-- actually everything

416
00:26:34,290 --> 00:26:37,960
I said is true, but if you want
to count the number of keys--

417
00:26:37,960 --> 00:26:40,210
I really wanted to count the
total number of keys that

418
00:26:40,210 --> 00:26:43,630
hash to the same place as ki.

419
00:26:43,630 --> 00:26:46,060
So there's one more
which is ki itself.

420
00:26:46,060 --> 00:26:48,920
Always hashes to
wherever ki hashes.

421
00:26:48,920 --> 00:26:50,800
So I did a summation
j not equal i

422
00:26:50,800 --> 00:26:57,670
but I should also have
a plus Iii-- captain.

423
00:26:57,670 --> 00:27:03,650
So there's the case when I
hashing to the same place which

424
00:27:03,650 --> 00:27:07,150
of course is always going to
happen so you get basically

425
00:27:07,150 --> 00:27:08,570
plus 1 everywhere.

426
00:27:11,390 --> 00:27:13,310
So that makes me
happier because then I

427
00:27:13,310 --> 00:27:15,835
actually get with the theorem
said which is 1 plus alpha.

428
00:27:15,835 --> 00:27:18,540
There is always going to be
the one guy hashing there

429
00:27:18,540 --> 00:27:21,970
when I assume that ki
hashed to wherever it does.

430
00:27:24,650 --> 00:27:27,980
So this tells you that if we
could find a universal hash

431
00:27:27,980 --> 00:27:32,800
family, then we're guaranteed
insert, delete, and search

432
00:27:32,800 --> 00:27:35,535
cost order 1 plus
alpha in expectation.

433
00:27:35,535 --> 00:27:38,520
And the expectation is
only over the choice of h,

434
00:27:38,520 --> 00:27:39,470
not over the inputs.

435
00:27:39,470 --> 00:27:42,466
I think I've stressed
that enough times.

436
00:27:42,466 --> 00:27:44,340
But the remaining question
is can we actually

437
00:27:44,340 --> 00:27:46,390
design a universal hash family?

438
00:27:46,390 --> 00:27:48,015
Are there any universal
hash families?

439
00:27:53,960 --> 00:27:56,672
Yes, as you might
expect there are.

440
00:27:56,672 --> 00:27:58,505
Otherwise this wouldn't
be very interesting.

441
00:28:07,140 --> 00:28:12,990
Let me give you an example of
a bad universal hash family.

442
00:28:12,990 --> 00:28:15,505
Sort of an oxymoron
but it's possible.

443
00:28:24,190 --> 00:28:25,380
Bad.

444
00:28:25,380 --> 00:28:27,600
Here's a hash family
that's universal.

445
00:28:27,600 --> 00:28:32,360
h is the set of
all hash functions.

446
00:28:32,360 --> 00:28:36,370
h from 0,1 to u minus 1.

447
00:28:44,010 --> 00:28:46,790
This is what's normally
called uniform hashing.

448
00:28:46,790 --> 00:28:50,350
It makes analysis
really easy because you

449
00:28:50,350 --> 00:28:53,240
get to assume-- I
mean this says ahead

450
00:28:53,240 --> 00:28:55,500
of time for every
universe item, I'm

451
00:28:55,500 --> 00:28:59,300
going to choose a
random slot to put it.

452
00:28:59,300 --> 00:29:01,510
And then I'll just
remember that.

453
00:29:01,510 --> 00:29:06,030
And so whenever you give me
the key, I'll just map it by h.

454
00:29:06,030 --> 00:29:10,420
And I get a consistent slot
and definitely it's universal.

455
00:29:10,420 --> 00:29:13,460
What's bad about
this hash function?

456
00:29:13,460 --> 00:29:14,541
Many things but--

457
00:29:17,427 --> 00:29:22,520
STUDENT: [INAUDIBLE] That's
just as hard as the problem I'm

458
00:29:22,520 --> 00:29:23,020
solving.

459
00:29:23,020 --> 00:29:23,820
ERIK DEMAINE: Sort of.

460
00:29:23,820 --> 00:29:25,460
I'm begging the
question that it's just

461
00:29:25,460 --> 00:29:27,240
as hard as the
problem I'm solving.

462
00:29:27,240 --> 00:29:31,022
And what, algorithmically,
what goes wrong here?

463
00:29:31,022 --> 00:29:32,230
There are two things I guess.

464
00:29:38,304 --> 00:29:38,804
Yeah?

465
00:29:38,804 --> 00:29:40,730
STUDENT: It's not deterministic?

466
00:29:40,730 --> 00:29:42,650
ERIK DEMAINE: It's
not deterministic.

467
00:29:42,650 --> 00:29:45,275
That's OK because we're
allowing randomization

468
00:29:45,275 --> 00:29:46,960
in this algorithm.

469
00:29:46,960 --> 00:29:49,100
So I mean how I
would compute this

470
00:29:49,100 --> 00:29:52,610
is I would do a four loop
over all universe items.

471
00:29:52,610 --> 00:29:54,980
And I assume I have a way
to generate a random number

472
00:29:54,980 --> 00:29:56,840
between 0 and m minus 1.

473
00:29:56,840 --> 00:29:58,570
That's legitimate.

474
00:29:58,570 --> 00:30:01,342
But there's something
bad about that algorithm.

475
00:30:01,342 --> 00:30:02,550
STUDENT: It's not consistent.

476
00:30:02,550 --> 00:30:03,758
ERIK DEMAINE: Not consistent?

477
00:30:03,758 --> 00:30:06,470
It is consistent if I precompute
for every universe item

478
00:30:06,470 --> 00:30:07,700
where to map it.

479
00:30:07,700 --> 00:30:08,720
That's good.

480
00:30:08,720 --> 00:30:10,670
So all these things
are actually OK.

481
00:30:10,670 --> 00:30:12,540
STUDENT: It takes too
much time and space.

482
00:30:12,540 --> 00:30:14,498
ERIK DEMAINE: It takes
too much time and space.

483
00:30:14,498 --> 00:30:16,460
Yeah.

484
00:30:16,460 --> 00:30:19,380
That's the bad thing.

485
00:30:19,380 --> 00:30:22,640
It's hard to isolate in a bad
thing what is so bad about it.

486
00:30:22,640 --> 00:30:29,710
But we need u time to compute
all those random numbers.

487
00:30:29,710 --> 00:30:32,540
And we need u space to
store that hash function.

488
00:30:32,540 --> 00:30:37,270
In order to get to the
consistency we have to-- Oops.

489
00:30:37,270 --> 00:30:38,850
Good catch.

490
00:30:38,850 --> 00:30:40,350
In order to get
consistency, we need

491
00:30:40,350 --> 00:30:43,840
to keep track of all those
hash function values.

492
00:30:43,840 --> 00:30:47,524
And that's not good.

493
00:30:47,524 --> 00:30:49,440
You could try to not
store them all, you know,

494
00:30:49,440 --> 00:30:50,400
use a hash table.

495
00:30:50,400 --> 00:30:53,620
But you can't use a hash table
to store a hash function.

496
00:30:53,620 --> 00:30:58,180
That would be-- that would
be infinite recursion.

497
00:30:58,180 --> 00:31:00,100
So but at least
they're out there.

498
00:31:00,100 --> 00:31:03,510
So the challenge is to find
an efficient hash family that

499
00:31:03,510 --> 00:31:05,690
doesn't take much space
to store and doesn't

500
00:31:05,690 --> 00:31:07,850
take much time to compute.

501
00:31:07,850 --> 00:31:09,786
OK, we're allowing randomness.

502
00:31:19,720 --> 00:31:21,280
But we don't want
to much randomness.

503
00:31:21,280 --> 00:31:23,620
We can't afford u units
of time of randomness.

504
00:31:23,620 --> 00:31:25,630
I mean u could be huge.

505
00:31:25,630 --> 00:31:28,800
We're only doing n operations
probably on this hash table.

506
00:31:28,800 --> 00:31:31,030
u could be way bigger than n.

507
00:31:31,030 --> 00:31:33,400
We don't want to have to
precompute this giant table

508
00:31:33,400 --> 00:31:35,170
and then use it for
like five steps.

509
00:31:35,170 --> 00:31:38,220
It would be really, really
slow even amortized.

510
00:31:38,220 --> 00:31:42,542
So here's one that
I will analyze.

511
00:31:42,542 --> 00:31:45,000
And there's another one in the
textbook which I'll mention.

512
00:31:49,800 --> 00:31:53,359
This one's a little
bit simpler to analyze.

513
00:31:53,359 --> 00:31:55,650
We're going to need a little
bit of number theory, just

514
00:31:55,650 --> 00:31:57,610
prime numbers.

515
00:31:57,610 --> 00:32:02,240
And you've probably heard of
the idea of your hash table size

516
00:32:02,240 --> 00:32:03,400
being prime.

517
00:32:03,400 --> 00:32:05,729
Here you'll see
why that's useful,

518
00:32:05,729 --> 00:32:06,770
at least for this family.

519
00:32:06,770 --> 00:32:08,860
You don't always need
primality, but it's

520
00:32:08,860 --> 00:32:11,320
going to make this family work.

521
00:32:11,320 --> 00:32:14,430
So I'm going to assume that
my table size is prime.

522
00:32:14,430 --> 00:32:17,716
Now really my table
size is doubling,

523
00:32:17,716 --> 00:32:18,840
so that's a little awkward.

524
00:32:18,840 --> 00:32:21,550
But luckily there are
algorithms given a number

525
00:32:21,550 --> 00:32:23,170
to find a nearby prime number.

526
00:32:23,170 --> 00:32:25,150
We're not going to
cover that here,

527
00:32:25,150 --> 00:32:27,500
but that's an algorithmic
number theory thing.

528
00:32:27,500 --> 00:32:29,860
And in polylogarithmic
time, I guess

529
00:32:29,860 --> 00:32:33,340
you can find a
nearby prime number.

530
00:32:33,340 --> 00:32:35,220
So you want it to
be a power of 2.

531
00:32:35,220 --> 00:32:38,390
And you'll just look around
for nearby prime numbers.

532
00:32:38,390 --> 00:32:41,090
And then we have a prime that's
about the same size so that

533
00:32:41,090 --> 00:32:45,550
will work just as well from
a table doubling perspective.

534
00:32:45,550 --> 00:32:49,810
Then furthermore,
for convenience,

535
00:32:49,810 --> 00:32:53,740
I'm going to assume that u
is an integer power of m.

536
00:33:01,404 --> 00:33:06,489
I want my universe to be
a power of that prime.

537
00:33:06,489 --> 00:33:08,530
I mean, if it isn't, just
make u a little bigger.

538
00:33:08,530 --> 00:33:10,113
It's OK if u gets
bigger as long as it

539
00:33:10,113 --> 00:33:13,450
covers all of the same items.

540
00:33:13,450 --> 00:33:19,340
Now once I view my universe
as a power of the table size,

541
00:33:19,340 --> 00:33:23,140
a natural thing to do is
take my universe items,

542
00:33:23,140 --> 00:33:27,530
to take my input integers,
and think of them in base m.

543
00:33:27,530 --> 00:33:29,730
So that's what I'm going to do.

544
00:33:29,730 --> 00:33:37,880
I'm going to view
a key k in base m.

545
00:33:37,880 --> 00:33:41,640
Whenever I have a
key, I can think of it

546
00:33:41,640 --> 00:33:51,850
as a vector of subkeys,
k1 up to kr minus 1.

547
00:33:51,850 --> 00:33:57,024
There are digits in base m
because of this relation.

548
00:33:57,024 --> 00:33:59,190
And I don't even care which
is the least significant

549
00:33:59,190 --> 00:34:00,606
and which is the
most significant.

550
00:34:00,606 --> 00:34:02,630
That won't matter so
whatever, whichever order

551
00:34:02,630 --> 00:34:05,130
you want to think of it.

552
00:34:05,130 --> 00:34:08,830
And each of the
ki's here I guess

553
00:34:08,830 --> 00:34:11,550
is between 0 and m minus 1.

554
00:34:17,480 --> 00:34:18,680
So far so good.

555
00:34:37,760 --> 00:34:40,670
So with this perspective,
the base m perspective,

556
00:34:40,670 --> 00:34:45,469
I can define a dot product
hash function as follows.

557
00:34:45,469 --> 00:34:48,520
It's going to be
parametrized by another key,

558
00:34:48,520 --> 00:34:52,865
I'll call it a, which we can
think of again as a vector.

559
00:34:57,380 --> 00:35:03,040
I want to define h sub a of k.

560
00:35:03,040 --> 00:35:04,790
So this is parametrized
by a, but it's

561
00:35:04,790 --> 00:35:10,910
a function of a given
key k as the dot product

562
00:35:10,910 --> 00:35:13,135
of those two vectors mod m.

563
00:35:16,390 --> 00:35:19,930
So remember dot products
are just the sum from i

564
00:35:19,930 --> 00:35:26,800
equals 0 to r minus
1 of a1 times ki.

565
00:35:26,800 --> 00:35:31,230
I want to do all
of that modulo m.

566
00:35:31,230 --> 00:35:33,992
We'll worry about
how long this takes

567
00:35:33,992 --> 00:35:37,710
to compute in a moment I guess.

568
00:35:37,710 --> 00:35:40,680
Maybe very soon.

569
00:35:40,680 --> 00:35:45,690
But the hash family h is
just all of these ha's

570
00:35:45,690 --> 00:35:48,560
for all possible choices of a.

571
00:35:52,276 --> 00:35:56,860
a was a key so it comes
from the universe u.

572
00:36:01,770 --> 00:36:04,530
And so what that means is
to do universal hashing,

573
00:36:04,530 --> 00:36:07,650
I want to choose one of these
ha's uniformly at random.

574
00:36:07,650 --> 00:36:08,700
How do I do that?

575
00:36:08,700 --> 00:36:11,410
I just choose a
uniformly at random.

576
00:36:11,410 --> 00:36:12,390
Pretty easy.

577
00:36:12,390 --> 00:36:16,230
It's one random value
from one random key.

578
00:36:16,230 --> 00:36:19,500
So that should take constant
time and constant space

579
00:36:19,500 --> 00:36:22,660
to store one number.

580
00:36:22,660 --> 00:36:28,100
In general we're in a world
called the Word RAM model.

581
00:36:28,100 --> 00:36:31,800
This is actually-- I
guess m stands for model

582
00:36:31,800 --> 00:36:33,610
so I shouldn't write model.

583
00:36:33,610 --> 00:36:38,240
Random access machine
which you may have heard.

584
00:36:38,240 --> 00:36:42,980
The word RAM assumes
that in general we're

585
00:36:42,980 --> 00:36:45,180
manipulating integers.

586
00:36:45,180 --> 00:36:49,550
And the integers fit in a word.

587
00:36:49,550 --> 00:36:51,430
And the computational
assumption is

588
00:36:51,430 --> 00:36:55,176
that manipulating a
constant number of words

589
00:36:55,176 --> 00:36:56,800
and doing essentially
any operation you

590
00:36:56,800 --> 00:37:01,060
want on constant number of
words takes constant time.

591
00:37:06,530 --> 00:37:08,430
And the other part
of the word RAM model

592
00:37:08,430 --> 00:37:11,770
is to assume that the things
you care about fit in a word.

593
00:37:16,950 --> 00:37:24,950
Say individual data values,
here we're talking about keys,

594
00:37:24,950 --> 00:37:28,190
fit in a word.

595
00:37:28,190 --> 00:37:30,590
This is what you need
to assume in [INAUDIBLE]

596
00:37:30,590 --> 00:37:33,830
that you can compute high
of x in constant time or low

597
00:37:33,830 --> 00:37:35,470
of x in constant time.

598
00:37:35,470 --> 00:37:38,250
Here I'm going to use it to
assume that we can compute

599
00:37:38,250 --> 00:37:41,790
h sub a of k in constant time.

600
00:37:41,790 --> 00:37:44,010
In practice this would
be done by implementing

601
00:37:44,010 --> 00:37:46,870
this computation, this
dot product computation,

602
00:37:46,870 --> 00:37:48,540
in hardware.

603
00:37:48,540 --> 00:37:53,420
And the reason a 64-bit
edition on a modern processor

604
00:37:53,420 --> 00:37:56,359
or a 32-bit on most
phones takes constant time

605
00:37:56,359 --> 00:37:58,150
is because there's
hardware that's designed

606
00:37:58,150 --> 00:38:00,050
to do that really fast.

607
00:38:00,050 --> 00:38:03,990
And in general we're assuming
that the things we care about

608
00:38:03,990 --> 00:38:06,137
fit in a single word.

609
00:38:06,137 --> 00:38:08,720
And we're assuming random access
and that we can have a raise.

610
00:38:08,720 --> 00:38:10,720
That's what we need in
order to store a table.

611
00:38:10,720 --> 00:38:12,930
And same thing in [INAUDIBLE],
we needed to assume we

612
00:38:12,930 --> 00:38:13,430
had a raise.

613
00:38:16,772 --> 00:38:18,400
And I think this
operation is actually

614
00:38:18,400 --> 00:38:22,540
pretty-- exists in Intel
architectures in some form.

615
00:38:22,540 --> 00:38:25,117
But it's certainly not
a normal operation.

616
00:38:25,117 --> 00:38:26,700
If you're going to
do this explicitly,

617
00:38:26,700 --> 00:38:28,340
adding up and
multiplying things this

618
00:38:28,340 --> 00:38:34,900
would be r is the log base m of
u, so it's kind of logish time.

619
00:38:34,900 --> 00:38:39,970
Maybe I'll mention
another hash family that's

620
00:38:39,970 --> 00:38:41,985
more obviously computable.

621
00:38:45,499 --> 00:38:46,540
But I won't analyze here.

622
00:38:46,540 --> 00:38:48,000
It's analyzed in the textbook.

623
00:38:48,000 --> 00:38:52,450
So if you're curious you
can check it out there.

624
00:38:52,450 --> 00:38:56,600
Let's call this just another.

625
00:39:15,620 --> 00:39:17,520
It's a bit weird
because it has two mods.

626
00:39:17,520 --> 00:39:19,100
You take mod p and then mod m.

627
00:39:19,100 --> 00:39:22,010
But the main computation
is very simple.

628
00:39:22,010 --> 00:39:24,390
You choose a uniformly
random value a.

629
00:39:24,390 --> 00:39:29,640
You multiply it by your key
in usual binary multiplication

630
00:39:29,640 --> 00:39:30,900
instead of dot product.

631
00:39:30,900 --> 00:39:34,070
And then you add another
uniformly random key.

632
00:39:34,070 --> 00:39:36,360
This is also universal.

633
00:39:36,360 --> 00:39:44,660
So H is hab for all a
and b that are keys.

634
00:39:48,629 --> 00:39:50,420
So if you're not happy
with this assumption

635
00:39:50,420 --> 00:39:52,272
that you can compute
this in constant time,

636
00:39:52,272 --> 00:39:53,980
you should be happy
with this assumption.

637
00:39:53,980 --> 00:39:56,396
If you believe in addition and
multiplication and division

638
00:39:56,396 --> 00:39:58,690
being constant time, then
this will be constant time.

639
00:40:01,860 --> 00:40:03,640
So both of these
families are universal.

640
00:40:03,640 --> 00:40:06,150
I'm going to prove that this
one is universal because it's

641
00:40:06,150 --> 00:40:06,790
a little bit easier.

642
00:40:06,790 --> 00:40:07,290
Yeah?

643
00:40:07,290 --> 00:40:09,640
STUDENT: Is this p a
choice that you made?

644
00:40:09,640 --> 00:40:10,640
ERIK DEMAINE: OK, right.

645
00:40:10,640 --> 00:40:11,704
What is p?

646
00:40:11,704 --> 00:40:19,030
P just has to be bigger than
m, and it should be prime.

647
00:40:19,030 --> 00:40:20,940
It's not random.

648
00:40:20,940 --> 00:40:24,550
You can just choose one prime
that's bigger than your table

649
00:40:24,550 --> 00:40:26,120
size, and this will work.

650
00:40:26,120 --> 00:40:29,340
STUDENT: [INAUDIBLE]

651
00:40:32,390 --> 00:40:33,860
ERIK DEMAINE: I
forget whether you

652
00:40:33,860 --> 00:40:35,160
have to assume that m is prime.

653
00:40:35,160 --> 00:40:37,630
I'd have to check.

654
00:40:37,630 --> 00:40:43,250
I'm guessing not, but
don't quote me on that.

655
00:40:43,250 --> 00:40:46,760
Check the section
in the textbook.

656
00:40:46,760 --> 00:40:47,920
So good.

657
00:40:47,920 --> 00:40:50,112
Easy to compute.

658
00:40:50,112 --> 00:40:52,570
The analysis is simpler, but
it's a little bit easier here.

659
00:40:52,570 --> 00:40:56,070
Essentially this is
very much like products

660
00:40:56,070 --> 00:40:59,720
but there's no
carries here from one.

661
00:40:59,720 --> 00:41:02,140
When we do the dot product
instead of just multiplying

662
00:41:02,140 --> 00:41:04,500
in base m we multiply
them based on that

663
00:41:04,500 --> 00:41:07,150
would give the same thing
as multiplying in base 2,

664
00:41:07,150 --> 00:41:10,330
but we get carries from one
m-sized digit to the next one.

665
00:41:10,330 --> 00:41:12,280
And that's just more
annoying to think about.

666
00:41:12,280 --> 00:41:14,535
So here we're essentially
getting rid of carries.

667
00:41:14,535 --> 00:41:17,170
So it's in some sense
even easier to compute.

668
00:41:17,170 --> 00:41:20,305
And in both cases,
it's universal.

669
00:41:24,370 --> 00:41:34,140
So we want to prove
this property.

670
00:41:34,140 --> 00:41:39,200
That if we choose a random
a then the probability

671
00:41:39,200 --> 00:41:42,170
of two keys, k and k'
which are distinct mapping

672
00:41:42,170 --> 00:41:49,450
via h to the same value is at
most 1/m So let's prove that.

673
00:42:06,450 --> 00:42:12,422
So we're given two keys.

674
00:42:12,422 --> 00:42:14,130
We have no control
over them because this

675
00:42:14,130 --> 00:42:16,645
has to work for all
keys that are distinct.

676
00:42:22,430 --> 00:42:24,550
The only thing we know
is that they're distinct.

677
00:42:24,550 --> 00:42:27,267
Now if two keys are
distinct, then their vectors

678
00:42:27,267 --> 00:42:27,975
must be distinct.

679
00:42:27,975 --> 00:42:29,360
If two vectors
are distinct, that

680
00:42:29,360 --> 00:42:32,269
means at least one
item must be different.

681
00:42:32,269 --> 00:42:33,185
Should sound familiar.

682
00:42:39,870 --> 00:42:43,240
So this was like in the matrix
multiplication verification

683
00:42:43,240 --> 00:42:46,420
algorithm that
[INAUDIBLE] taught.

684
00:42:46,420 --> 00:42:54,855
So k and k' differ
in some digit.

685
00:42:58,190 --> 00:42:59,440
Let's call that digit d.

686
00:43:02,902 --> 00:43:06,830
So k sub d is different
from k sub d'.

687
00:43:09,370 --> 00:43:14,590
And I want to compute
this probability.

688
00:43:14,590 --> 00:43:15,530
We'll rewrite it.

689
00:43:33,970 --> 00:43:36,450
The probability is over a.

690
00:43:36,450 --> 00:43:38,400
I'm choosing a
uniformly at random.

691
00:43:38,400 --> 00:43:39,900
I want another
probability that that

692
00:43:39,900 --> 00:43:43,520
maps k and k' to the same slot.

693
00:43:43,520 --> 00:43:47,210
So let me just write
out the definition.

694
00:43:47,210 --> 00:43:58,750
It's probability over a that
the dot product of a and k

695
00:43:58,750 --> 00:44:12,180
is the same thing as when I do
the dot product with k' mod m.

696
00:44:12,180 --> 00:44:15,620
These two, that sum should
come out the same, mod m.

697
00:44:19,570 --> 00:44:25,210
So let me move this part over to
this side because in both cases

698
00:44:25,210 --> 00:44:26,490
we have the same ai.

699
00:44:26,490 --> 00:44:28,920
So I can group
terms and say this

700
00:44:28,920 --> 00:44:45,640
is the probability--
probability sum over i

701
00:44:45,640 --> 00:44:50,900
equals 0 to r minus 1
of ai times ki minus

702
00:44:50,900 --> 00:44:54,420
ki prime equals 0.

703
00:44:57,660 --> 00:44:58,160
Mod m.

704
00:45:12,380 --> 00:45:14,750
OK, no pun intended.

705
00:45:14,750 --> 00:45:19,430
Now we care about this digit d.

706
00:45:19,430 --> 00:45:22,210
d is a place where we know
that this is non-zero.

707
00:45:22,210 --> 00:45:28,270
So let me separate out the terms
for d and everything but d.

708
00:45:28,270 --> 00:45:34,630
So this is the same as ability
of, let's do the d term first,

709
00:45:34,630 --> 00:45:41,920
so we have ad times
kd minus kd prime.

710
00:45:41,920 --> 00:45:43,240
That's one term.

711
00:45:43,240 --> 00:45:46,860
I'm going to write
the summation of i

712
00:45:46,860 --> 00:45:56,485
not equal to d of ai
ki minus ki prime.

713
00:45:56,485 --> 00:45:58,110
These ones, some of
them might be zero.

714
00:45:58,110 --> 00:45:58,990
Some are not.

715
00:45:58,990 --> 00:46:00,850
We're not going
to worry about it.

716
00:46:00,850 --> 00:46:03,105
It's enough to just isolate
one term that is non-zero.

717
00:46:08,550 --> 00:46:11,520
So this thing we know
does not equal zero.

718
00:46:14,370 --> 00:46:15,350
Cool.

719
00:46:15,350 --> 00:46:17,850
Here's where I'm going to use
a little bit of number theory.

720
00:46:17,850 --> 00:46:20,360
I haven't yet used
that m is prime.

721
00:46:20,360 --> 00:46:27,120
I required m is prime because
when you're working modulo m,

722
00:46:27,120 --> 00:46:30,470
you have multiplicative
inverses.

723
00:46:30,470 --> 00:46:32,430
Because this is
not zero, there is

724
00:46:32,430 --> 00:46:35,190
something I can
multiply on both sides

725
00:46:35,190 --> 00:46:40,800
and get this to cancel
out and become one.

726
00:46:40,800 --> 00:46:43,650
For every value x
there is a value y.

727
00:46:43,650 --> 00:46:46,170
So x times y equals 1 modulo m.

728
00:46:46,170 --> 00:46:48,070
And you can even compute
it in constant time

729
00:46:48,070 --> 00:46:50,410
in a reasonable model.

730
00:46:50,410 --> 00:47:08,290
So then I can say I want the
probability that ad is minus

731
00:47:08,290 --> 00:47:12,030
kd minus kd prime inverse.

732
00:47:12,030 --> 00:47:14,520
This is the multiplicative
inverse I was talking about.

733
00:47:14,520 --> 00:47:20,680
And then the sum i not equal
to d whatever, I don't actually

734
00:47:20,680 --> 00:47:27,264
care what this is too much, I've
already done the equals part.

735
00:47:27,264 --> 00:47:28,430
I still need to write mod m.

736
00:47:31,240 --> 00:47:36,520
The point is this
is all about ad.

737
00:47:36,520 --> 00:47:38,850
Remember we're choosing
a uniformly at random.

738
00:47:38,850 --> 00:47:40,700
That's the same
thing as choosing

739
00:47:40,700 --> 00:47:45,896
each of the ai's independently
uniformly at random.

740
00:47:45,896 --> 00:47:47,372
Yeah?

741
00:47:47,372 --> 00:47:53,276
STUDENT: Is the second line over
there isolating d [INAUDIBLE]?

742
00:47:53,276 --> 00:47:54,667
Second from the top.

743
00:47:54,667 --> 00:47:55,500
ERIK DEMAINE: Which?

744
00:47:55,500 --> 00:47:56,020
This one?

745
00:47:56,020 --> 00:47:56,520
STUDENT: No up.

746
00:47:56,520 --> 00:47:57,311
ERIK DEMAINE: This?

747
00:47:57,311 --> 00:47:58,398
STUDENT: Down.

748
00:47:58,398 --> 00:47:58,898
That one.

749
00:47:58,898 --> 00:47:59,380
No.

750
00:47:59,380 --> 00:48:00,171
The one below that.

751
00:48:00,171 --> 00:48:01,790
ERIK DEMAINE: Yes.

752
00:48:01,790 --> 00:48:03,730
STUDENT: Is that line
isolating d or is that--

753
00:48:03,730 --> 00:48:04,438
ERIK DEMAINE: No.

754
00:48:04,438 --> 00:48:05,490
I haven't isolated d yet.

755
00:48:05,490 --> 00:48:06,970
This is all the terms.

756
00:48:06,970 --> 00:48:08,970
And then going from
this line to this one,

757
00:48:08,970 --> 00:48:12,910
I'm just pulling out
the i equals d term.

758
00:48:12,910 --> 00:48:13,650
That's this term.

759
00:48:13,650 --> 00:48:16,324
And then separating out
the i not equal to d.

760
00:48:16,324 --> 00:48:17,090
STUDENT: I get it.

761
00:48:17,090 --> 00:48:17,380
ERIK DEMAINE: Right?

762
00:48:17,380 --> 00:48:18,980
This sum is just the
same as that sum.

763
00:48:18,980 --> 00:48:20,480
But I've done the
d term explicitly.

764
00:48:20,480 --> 00:48:21,063
STUDENT: Sure.

765
00:48:21,063 --> 00:48:21,780
I get it.

766
00:48:24,310 --> 00:48:27,150
ERIK DEMAINE: So I've
done all this rewriting

767
00:48:27,150 --> 00:48:29,780
because I know that ad is
chosen uniformly at random.

768
00:48:29,780 --> 00:48:34,340
Here we have this
thing, this monstrosity,

769
00:48:34,340 --> 00:48:36,890
but it does not depend on ad.

770
00:48:36,890 --> 00:48:39,300
In fact it is independent of ad.

771
00:48:39,300 --> 00:48:44,570
I'm going to write this
as a function of k and k'

772
00:48:44,570 --> 00:48:46,850
because those are
given to us and fixed.

773
00:48:46,850 --> 00:48:50,310
And then it's also a
function of a0 and a1.

774
00:48:50,310 --> 00:48:53,520
Everything except d.

775
00:48:53,520 --> 00:49:01,920
So ad minus 1, ad plus 1,
and so on up to ar minus 1.

776
00:49:01,920 --> 00:49:03,700
This is awkward to write.

777
00:49:03,700 --> 00:49:06,230
But everything except
ad appears here

778
00:49:06,230 --> 00:49:09,230
because we have
i not equal to d.

779
00:49:09,230 --> 00:49:13,040
And these ai's are
random variables.

780
00:49:13,040 --> 00:49:16,460
But we're assuming that they're
all chosen independently

781
00:49:16,460 --> 00:49:17,910
from each other.

782
00:49:17,910 --> 00:49:21,720
So I don't really care what's
going on in this function.

783
00:49:21,720 --> 00:49:22,660
It's something.

784
00:49:22,660 --> 00:49:24,390
And if I rewrite
this probability,

785
00:49:24,390 --> 00:49:27,636
it's the probability
over the choice of a.

786
00:49:27,636 --> 00:49:31,720
I can separate out the
choice of all these things

787
00:49:31,720 --> 00:49:35,320
from the choice of ad.

788
00:49:35,320 --> 00:49:39,560
And this is just
a useful formula.

789
00:49:39,560 --> 00:49:43,500
I'm going to write
a not equal to d.

790
00:49:43,500 --> 00:49:48,400
All the other-- maybe I'll
write a sub i not equal to d.

791
00:49:48,400 --> 00:49:51,080
All the choices of
those guys separately

792
00:49:51,080 --> 00:49:59,700
from the probability
of choosing ad of ad

793
00:49:59,700 --> 00:50:00,895
equal to this function.

794
00:50:05,090 --> 00:50:08,200
If you just think about the
definition of expectation,

795
00:50:08,200 --> 00:50:09,560
this is doing the same thing.

796
00:50:09,560 --> 00:50:12,780
We're thinking of first
choosing the ai's where

797
00:50:12,780 --> 00:50:14,370
i is not equal to d.

798
00:50:14,370 --> 00:50:15,970
And then we choose ad.

799
00:50:15,970 --> 00:50:19,470
And this computational will
come out the same as that.

800
00:50:25,110 --> 00:50:28,720
But this is the probability
of a uniformly random number

801
00:50:28,720 --> 00:50:31,680
equaling something.

802
00:50:31,680 --> 00:50:35,950
So we just need to
think about-- sorry.

803
00:50:35,950 --> 00:50:37,470
Important.

804
00:50:37,470 --> 00:50:39,810
That would be pretty
unlikely that would be 1/u,

805
00:50:39,810 --> 00:50:42,970
but this is all
working modulo m.

806
00:50:42,970 --> 00:50:45,760
So if I just take a
uniformly random integer

807
00:50:45,760 --> 00:50:49,530
and the chance of it hitting any
particular value mod m is 1/m.

808
00:50:53,011 --> 00:50:54,010
And that's universality.

809
00:50:57,430 --> 00:51:02,500
So in this case, you get exactly
1/m, no less than or equal to.

810
00:51:02,500 --> 00:51:06,440
Sorry, I should have written
it's the expectation of 1/m,

811
00:51:06,440 --> 00:51:12,540
but that's 1/m because 1/m
has no random parts in it.

812
00:51:12,540 --> 00:51:13,412
Yeah?

813
00:51:13,412 --> 00:51:15,220
STUDENT: How do
we know that the,

814
00:51:15,220 --> 00:51:19,735
that this expression doesn't
have any biases in the sense

815
00:51:19,735 --> 00:51:23,832
that it doesn't give more,
more, like if you give it

816
00:51:23,832 --> 00:51:26,718
the uniform
distribution of numbers,

817
00:51:26,718 --> 00:51:28,642
it doesn't spit out
more numbers than others

818
00:51:28,642 --> 00:51:30,514
and that could potentially--

819
00:51:30,514 --> 00:51:31,930
ERIK DEMAINE: Oh,
so you're asking

820
00:51:31,930 --> 00:51:35,360
how do we know that
this hash family doesn't

821
00:51:35,360 --> 00:51:38,085
prefer some slots
over others, I guess.

822
00:51:38,085 --> 00:51:41,378
STUDENT: Of course like
after the equals sign,

823
00:51:41,378 --> 00:51:46,219
like in this middle
line in the middle.

824
00:51:46,219 --> 00:51:46,760
Middle board.

825
00:51:46,760 --> 00:51:47,718
ERIK DEMAINE: This one?

826
00:51:47,718 --> 00:51:49,304
Oh, this one.

827
00:51:49,304 --> 00:51:50,220
STUDENT: Middle board.

828
00:51:50,220 --> 00:51:51,531
ERIK DEMAINE: Middle board.

829
00:51:51,531 --> 00:51:52,030
Here.

830
00:51:52,030 --> 00:51:53,056
STUDENT: Yes.

831
00:51:53,056 --> 00:51:54,680
So how do we know
that if you give it--

832
00:51:54,680 --> 00:51:55,846
ERIK DEMAINE: This function.

833
00:51:55,846 --> 00:51:59,760
STUDENT: --random variables,
it won't prefer certain numbers

834
00:51:59,760 --> 00:52:00,480
over others?

835
00:52:00,480 --> 00:52:03,560
ERIK DEMAINE: So this function
may prefer some numbers

836
00:52:03,560 --> 00:52:04,940
over others.

837
00:52:04,940 --> 00:52:06,300
But it doesn't matter.

838
00:52:06,300 --> 00:52:08,310
All we need is
that this function

839
00:52:08,310 --> 00:52:10,285
is independent of
our choice of ad.

840
00:52:10,285 --> 00:52:12,500
So you can think
of this function,

841
00:52:12,500 --> 00:52:15,010
you choose all of these
random-- actually k and k'

842
00:52:15,010 --> 00:52:18,179
are not random-- but you choose
all these random numbers.

843
00:52:18,179 --> 00:52:19,220
Then you evaluate your f.

844
00:52:19,220 --> 00:52:20,970
Maybe it always comes out to 5.

845
00:52:20,970 --> 00:52:21,470
Who knows.

846
00:52:21,470 --> 00:52:23,090
It could be super biased.

847
00:52:23,090 --> 00:52:26,430
But then you choose ad
uniformly at random.

848
00:52:26,430 --> 00:52:29,350
So the chance of ad
equalling 5 is the same

849
00:52:29,350 --> 00:52:31,730
as the chance of ad equaling 3.

850
00:52:31,730 --> 00:52:34,410
So in all cases, you get
the probability is 1/m.

851
00:52:34,410 --> 00:52:36,020
What we need is independence.

852
00:52:36,020 --> 00:52:39,220
We need that the ad is chosen
independently from the other

853
00:52:39,220 --> 00:52:39,810
ai's.

854
00:52:39,810 --> 00:52:42,210
But we don't need to know
anything about f other

855
00:52:42,210 --> 00:52:44,640
than it doesn't depend on ad.

856
00:52:44,640 --> 00:52:48,690
So and we made it not depend
on ad because I isolated ad

857
00:52:48,690 --> 00:52:50,600
by pulling it out
of that summation.

858
00:52:50,600 --> 00:52:53,370
So we know there's
no ad's over here.

859
00:52:53,370 --> 00:52:56,110
Good question.

860
00:52:56,110 --> 00:52:58,825
You get a bonus Frisbee
for your question.

861
00:53:01,500 --> 00:53:02,890
All right.

862
00:53:02,890 --> 00:53:06,520
That ends universal hashing.

863
00:53:06,520 --> 00:53:08,010
Any more questions?

864
00:53:08,010 --> 00:53:10,390
So at this point we
have at least one

865
00:53:10,390 --> 00:53:12,350
universal hash family.

866
00:53:12,350 --> 00:53:15,790
So we're just choosing, in this
case, a uniformly at random.

867
00:53:15,790 --> 00:53:19,400
In the other method, we choose
a and b uniformly at random.

868
00:53:19,400 --> 00:53:23,040
And then we build
our hash table.

869
00:53:23,040 --> 00:53:25,667
And the hash function
depends on m.

870
00:53:25,667 --> 00:53:27,500
So also every time we
double our table size,

871
00:53:27,500 --> 00:53:29,570
we're going to have to
choose a new hash function

872
00:53:29,570 --> 00:53:32,340
for the new value of m.

873
00:53:32,340 --> 00:53:34,440
And that's about it.

874
00:53:34,440 --> 00:53:38,870
So this will give us constant
expected time-- or in general 1

875
00:53:38,870 --> 00:53:42,480
plus alpha if you're not doing
table doubling-- for insert,

876
00:53:42,480 --> 00:53:45,230
delete, and exact search.

877
00:53:45,230 --> 00:53:49,230
Just building on the
hashing with chaining.

878
00:53:49,230 --> 00:53:50,760
And so this is a good method.

879
00:53:50,760 --> 00:53:51,551
Question?

880
00:53:51,551 --> 00:53:54,497
STUDENT: Why do you say expected
value of the probability?

881
00:53:54,497 --> 00:53:58,430
Isn't it sufficient to just say
the probability of [INAUDIBLE]?

882
00:53:58,430 --> 00:54:02,210
ERIK DEMAINE: Uh, yeah,
I wanted to isolate--

883
00:54:02,210 --> 00:54:05,400
it is the overall probability
of this happening.

884
00:54:05,400 --> 00:54:07,140
I rewrote it this
way because I wanted

885
00:54:07,140 --> 00:54:09,640
to think about first choosing
the ai's where i does not

886
00:54:09,640 --> 00:54:12,225
equal d and then choosing ad.

887
00:54:12,225 --> 00:54:14,180
So this probability
was supposed to be only

888
00:54:14,180 --> 00:54:15,626
over the choice of ad.

889
00:54:15,626 --> 00:54:17,700
And you have to do something
with the other ai's

890
00:54:17,700 --> 00:54:18,470
because they're random.

891
00:54:18,470 --> 00:54:20,345
You can't just say,
what's the probability ad

892
00:54:20,345 --> 00:54:21,800
equaling a random variable?

893
00:54:21,800 --> 00:54:23,300
That's a little sketchy.

894
00:54:23,300 --> 00:54:25,255
I wanted to have no
random variables over all.

895
00:54:25,255 --> 00:54:28,460
So I have to kind of bind
those variables with something.

896
00:54:28,460 --> 00:54:32,480
And I just want to see what
the-- This doesn't really

897
00:54:32,480 --> 00:54:35,860
affect very much, but to
make this algebraically

898
00:54:35,860 --> 00:54:38,610
correct I need to say
what the ai's, i not

899
00:54:38,610 --> 00:54:41,490
equal to d are doing.

900
00:54:41,490 --> 00:54:43,214
Other questions?

901
00:54:43,214 --> 00:54:43,714
Yeah.

902
00:54:43,714 --> 00:54:45,922
STUDENT: Um, I'm a bit
confused about your definition

903
00:54:45,922 --> 00:54:50,546
of the collision in
the lower left board.

904
00:54:50,546 --> 00:54:53,357
Why are you adding
i's [INAUDIBLE]?

905
00:54:53,357 --> 00:54:54,440
ERIK DEMAINE: Yeah, sorry.

906
00:54:54,440 --> 00:54:56,320
This is a funny
notion of colliding.

907
00:54:56,320 --> 00:54:58,560
I just mean I want to count
the number of keys that

908
00:54:58,560 --> 00:55:00,350
hash to the same slot as ki.

909
00:55:00,350 --> 00:55:04,106
STUDENT: So it's not necessarily
like a collision [INAUDIBLE].

910
00:55:04,106 --> 00:55:05,480
ERIK DEMAINE: You
may not call it

911
00:55:05,480 --> 00:55:08,250
a collision when it
collides with itself, yeah.

912
00:55:08,250 --> 00:55:11,050
Whatever you want to call it.

913
00:55:11,050 --> 00:55:14,920
But I just mean hashing
to the same slot is ki.

914
00:55:14,920 --> 00:55:15,470
Yeah.

915
00:55:15,470 --> 00:55:17,960
Just because I want to count
the total length of the chain.

916
00:55:17,960 --> 00:55:21,210
I don't want to count the number
of collisions in the chain.

917
00:55:21,210 --> 00:55:21,710
Sorry.

918
00:55:21,710 --> 00:55:23,220
Probably a poor choice of word.

919
00:55:27,720 --> 00:55:31,179
We're hashing because
we're taking our key,

920
00:55:31,179 --> 00:55:32,720
we're cutting it up
into little bits,

921
00:55:32,720 --> 00:55:35,480
and then we're mixing them up
just like a good corned beef

922
00:55:35,480 --> 00:55:38,550
hash or something.

923
00:55:38,550 --> 00:55:41,140
All right let's move
on to perfect hashing.

924
00:55:41,140 --> 00:55:44,950
This is more
exciting I would say.

925
00:55:44,950 --> 00:55:48,690
Even cooler-- this was cool
from a probability perspective,

926
00:55:48,690 --> 00:55:50,590
depending on your
notion of cool.

927
00:55:50,590 --> 00:55:53,280
This method will be cool from
a data structures perspective

928
00:55:53,280 --> 00:55:54,610
and a probability perspective.

929
00:55:57,630 --> 00:56:01,600
But so far data structures
are what we know from 006.

930
00:56:01,600 --> 00:56:06,080
Now we're going to go
up a level, literally.

931
00:56:06,080 --> 00:56:08,930
We're going to have two levels.

932
00:56:08,930 --> 00:56:12,410
So here we're solving-- you
can actually make this data

933
00:56:12,410 --> 00:56:13,420
structure dynamic.

934
00:56:13,420 --> 00:56:15,800
But we're going to solve
the static dictionary

935
00:56:15,800 --> 00:56:24,530
problem which is when you
have no inserts and deletes.

936
00:56:24,530 --> 00:56:26,195
You're given the keys up front.

937
00:56:29,760 --> 00:56:30,900
You're given n keys.

938
00:56:30,900 --> 00:56:34,560
You want to build a table
that supports search.

939
00:56:38,424 --> 00:56:39,590
And that's it.

940
00:56:39,590 --> 00:56:42,550
You want search to
be constant time

941
00:56:42,550 --> 00:56:51,820
and perfect hashing,
also known as FKS hashing

942
00:56:51,820 --> 00:56:55,390
because it was invented by
Fredman, Komlos, and Szemeredi

943
00:56:55,390 --> 00:56:59,270
in 1984.

944
00:56:59,270 --> 00:57:09,520
What we will achieve is constant
time worst case for search.

945
00:57:16,670 --> 00:57:19,010
So that's a little
better because here we're

946
00:57:19,010 --> 00:57:22,000
just doing constant
expected time for search.

947
00:57:22,000 --> 00:57:26,260
But it's worse in that we have
to know the keys up in advance.

948
00:57:26,260 --> 00:57:30,620
We're going to take the linear
space in the worst case.

949
00:57:40,250 --> 00:57:41,750
And then the
remaining question is

950
00:57:41,750 --> 00:57:44,870
how long does it take you to
build this data structure?

951
00:57:44,870 --> 00:57:47,570
And for now I'll just
say it's polynomial time.

952
00:57:47,570 --> 00:57:49,820
It's actually going
to be nearly linear.

953
00:57:56,750 --> 00:58:00,150
And this is also
an expected bounds.

954
00:58:00,150 --> 00:58:05,111
Actually with high probability
could be a little more strong

955
00:58:05,111 --> 00:58:05,610
here.

956
00:58:07,745 --> 00:58:09,620
So it's going to take
us a little bit of time

957
00:58:09,620 --> 00:58:11,536
to build this structure,
but once you have it,

958
00:58:11,536 --> 00:58:12,925
you have the perfect scenario.

959
00:58:12,925 --> 00:58:14,300
There's going to
be in some sense

960
00:58:14,300 --> 00:58:16,591
no collisions in our hash
table so it would be constant

961
00:58:16,591 --> 00:58:19,936
times first search
and linear space.

962
00:58:19,936 --> 00:58:20,810
So that part's great.

963
00:58:20,810 --> 00:58:24,710
The only catch is it's static.

964
00:58:24,710 --> 00:58:30,500
But beggars can't
be choosers I guess.

965
00:58:30,500 --> 00:58:31,040
All right.

966
00:58:34,102 --> 00:58:36,060
I'm not sure who's begging
in that analogy but.

967
00:58:40,370 --> 00:58:41,900
The keys who want to be stored.

968
00:58:41,900 --> 00:58:43,580
I don't know.

969
00:58:43,580 --> 00:58:48,170
All right, so the big
idea for perfect hashing

970
00:58:48,170 --> 00:58:49,350
is to use two levels.

971
00:58:55,710 --> 00:58:57,980
So let me draw a picture.

972
00:58:57,980 --> 00:59:04,240
We have our universe, and we're
mapping that via hash function

973
00:59:04,240 --> 00:59:07,180
h1 into a table.

974
00:59:07,180 --> 00:59:08,640
Look familiar?

975
00:59:08,640 --> 00:59:11,540
Exactly the diagram
I drew before.

976
00:59:11,540 --> 00:59:14,970
It's going to have
some table size m.

977
00:59:14,970 --> 00:59:22,090
And we're going to set m to be
within a constant factor of n.

978
00:59:22,090 --> 00:59:25,410
So right now it looks
exactly like regular--

979
00:59:25,410 --> 00:59:27,380
and it's going to
be a universal,

980
00:59:27,380 --> 00:59:30,160
h1 is chosen from a
universal hash family,

981
00:59:30,160 --> 00:59:34,080
so universal hashing applies.

982
00:59:34,080 --> 00:59:38,760
The trouble is we're going
to get some lists here.

983
00:59:38,760 --> 00:59:44,755
And we don't want to store
the set of colliding elements,

984
00:59:44,755 --> 00:59:47,380
the set of elements that hash to
that place, with a linked list

985
00:59:47,380 --> 00:59:50,550
because linked lists are slow.

986
00:59:50,550 --> 00:59:53,692
Instead we're going to store
them using a hash table.

987
00:59:53,692 --> 00:59:56,000
It sounds crazy.

988
00:59:56,000 --> 01:00:01,340
But we're going to have--
so this is position 1.

989
01:00:01,340 --> 01:00:04,470
This is going to be h2,1.

990
01:00:04,470 --> 01:00:10,500
There's going to be another hash
function h2,0 that maps to some

991
01:00:10,500 --> 01:00:11,230
other hash table.

992
01:00:11,230 --> 01:00:14,180
These hash tables are going
to be of varying sizes.

993
01:00:14,180 --> 01:00:19,300
Some of them will be of size 0
because nothing hashes there.

994
01:00:19,300 --> 01:00:21,420
But in general
each of these slots

995
01:00:21,420 --> 01:00:25,570
is going to map instead of to
a linked list to a hash table.

996
01:00:25,570 --> 01:00:31,260
So this would be h2, m minus 1.

997
01:00:31,260 --> 01:00:33,840
I'm going to guarantee in
the second level of hashing

998
01:00:33,840 --> 01:00:34,960
there are zero collisions.

999
01:00:50,590 --> 01:00:53,130
Let that sink in a little bit.

1000
01:00:53,130 --> 01:00:56,100
Let me write down a little
more carefully what I'm doing.

1001
01:01:09,050 --> 01:01:12,330
So h1 is picked from a
universal hash family.

1002
01:01:20,220 --> 01:01:25,420
Where m is theta n.

1003
01:01:25,420 --> 01:01:27,680
I want to put a theta-- I
mean I could m equals n,

1004
01:01:27,680 --> 01:01:29,810
but sometimes we
require m to be a prime.

1005
01:01:29,810 --> 01:01:32,164
So I'm going to give you some
slop in how you choose m.

1006
01:01:32,164 --> 01:01:33,830
So it can be prime
or whatever you want.

1007
01:01:36,370 --> 01:01:37,880
And then at the
first level we're

1008
01:01:37,880 --> 01:01:40,810
basically doing
hashing with chaining.

1009
01:01:40,810 --> 01:01:50,580
And now I want to look at
each slot in that hash table.

1010
01:01:50,580 --> 01:01:51,650
So between 0 and m-1.

1011
01:01:55,520 --> 01:02:02,150
I'm going to let lj be the
number of keys that hash,

1012
01:02:02,150 --> 01:02:05,520
it's the length of the
list that would go there.

1013
01:02:05,520 --> 01:02:07,130
It's going to be
the number of keys,

1014
01:02:07,130 --> 01:02:23,730
among just the n keys, Number
of, keys hashing to slot j.

1015
01:02:26,760 --> 01:02:30,200
So now the big question
is, if I have lj keys here,

1016
01:02:30,200 --> 01:02:31,830
how big do I make that table?

1017
01:02:31,830 --> 01:02:33,750
You might say, well
I make a theta lj.

1018
01:02:33,750 --> 01:02:34,750
That's what I always do.

1019
01:02:34,750 --> 01:02:36,640
But that's not what
I'm going to do.

1020
01:02:36,640 --> 01:02:38,380
That wouldn't help.

1021
01:02:38,380 --> 01:02:40,490
We get exactly, I
think, the same number

1022
01:02:40,490 --> 01:02:44,450
of collisions if we did that,
more or less, in expectation.

1023
01:02:44,450 --> 01:02:49,340
So we're going do
something else.

1024
01:02:49,340 --> 01:02:54,150
We're going to pick a hash
function from a universal

1025
01:02:54,150 --> 01:02:55,300
family, h2,j.

1026
01:02:58,720 --> 01:03:00,170
It again maps the same universe.

1027
01:03:05,510 --> 01:03:08,200
The key thing is the
size of the hash table

1028
01:03:08,200 --> 01:03:13,227
I'm going to choose
which is lj squared.

1029
01:03:32,510 --> 01:03:37,890
So if there are 3 elements that
happen to hash to this slot,

1030
01:03:37,890 --> 01:03:42,730
this table will have size 9.

1031
01:03:42,730 --> 01:03:43,920
So it's mostly empty.

1032
01:03:43,920 --> 01:03:47,180
Only square root fraction--
if that's a word, if that's

1033
01:03:47,180 --> 01:03:48,930
a phrase-- will be full.

1034
01:03:48,930 --> 01:03:50,050
Most of it's empty.

1035
01:03:50,050 --> 01:03:50,735
Why squared?

1036
01:03:53,370 --> 01:03:55,820
Any ideas?

1037
01:03:55,820 --> 01:03:59,460
I claim this will guarantee zero
collisions with decent chance.

1038
01:03:59,460 --> 01:03:59,960
Yeah.

1039
01:03:59,960 --> 01:04:01,860
STUDENT: With 1/2
probability you're

1040
01:04:01,860 --> 01:04:03,474
going to end up
with no collisions.

1041
01:04:03,474 --> 01:04:04,890
ERIK DEMAINE: With
1/2 probability

1042
01:04:04,890 --> 01:04:05,880
I'm going to end up
with no collisions.

1043
01:04:05,880 --> 01:04:06,290
Why?

1044
01:04:06,290 --> 01:04:06,998
What's it called?

1045
01:04:09,516 --> 01:04:11,219
STUDENT: Markov [INAUDIBLE]

1046
01:04:11,219 --> 01:04:13,260
ERIK DEMAINE: Markov's
inequality would prove it.

1047
01:04:13,260 --> 01:04:17,970
But it's more commonly
known as the, whoa,

1048
01:04:17,970 --> 01:04:21,020
as the birthday paradox.

1049
01:04:21,020 --> 01:04:25,280
So the whole name of the game
here is the birthday paradox.

1050
01:04:25,280 --> 01:04:29,315
If I have, how's
it go, if I have n

1051
01:04:29,315 --> 01:04:33,450
squared people with n
possible birthdays then--

1052
01:04:33,450 --> 01:04:35,430
is that the right way?

1053
01:04:35,430 --> 01:04:36,240
No, less.

1054
01:04:36,240 --> 01:04:40,280
If I have n people and n
squared possible birthdays,

1055
01:04:40,280 --> 01:04:42,700
the probability of getting a
collision, a shared birthday,

1056
01:04:42,700 --> 01:04:44,390
is 1/2.

1057
01:04:44,390 --> 01:04:46,740
Normally we think of
that as a funny thing.

1058
01:04:46,740 --> 01:04:48,860
You know, if I choose a
fair number of people,

1059
01:04:48,860 --> 01:04:51,330
then I get immediately
a collision.

1060
01:04:51,330 --> 01:04:52,830
I'm going to do it
the opposite way.

1061
01:04:52,830 --> 01:04:56,130
I'm going to guarantee that
there's so many birthdays

1062
01:04:56,130 --> 01:04:59,560
that no 2 of them will collide
with probability of 1/2 No,

1063
01:04:59,560 --> 01:05:00,430
1/2 is not great.

1064
01:05:00,430 --> 01:05:01,430
We're going to fix that.

1065
01:05:08,230 --> 01:05:11,880
So actually I haven't given
you the whole algorithm yet.

1066
01:05:11,880 --> 01:05:14,050
There are two steps, 1 and 2.

1067
01:05:14,050 --> 01:05:19,920
But there are also two
other steps 1.5 and 2.5.

1068
01:05:19,920 --> 01:05:22,290
But this is the right
idea and this will make

1069
01:05:22,290 --> 01:05:23,660
things work in expectation.

1070
01:05:23,660 --> 01:05:26,020
But I'm going to
tweak it a little bit.

1071
01:05:28,810 --> 01:05:30,650
So first let me
tell you step 1.5.

1072
01:05:30,650 --> 01:05:33,170
It fits in between the two.

1073
01:05:33,170 --> 01:05:38,100
I want that the space of this
data structure is linear.

1074
01:05:38,100 --> 01:05:40,050
So I need to make sure it is.

1075
01:05:40,050 --> 01:05:48,840
If the sum j equals 0 to
m minus 1 of lj squared

1076
01:05:48,840 --> 01:05:50,570
is bigger than
some constant times

1077
01:05:50,570 --> 01:05:55,250
n-- we'll figure out what the
constant is later-- then redo

1078
01:05:55,250 --> 01:05:58,020
step 1.

1079
01:05:58,020 --> 01:06:01,360
So after I do step 1, I know
how big all these tables

1080
01:06:01,360 --> 01:06:02,240
are going to be.

1081
01:06:02,240 --> 01:06:07,140
If the sum of those squares is
bigger than linear, start over.

1082
01:06:07,140 --> 01:06:09,180
I need to prove
that this will only

1083
01:06:09,180 --> 01:06:12,180
have to take-- this
will happen an expected

1084
01:06:12,180 --> 01:06:13,690
constant number of times.

1085
01:06:13,690 --> 01:06:16,120
log n times with
high probability.

1086
01:06:16,120 --> 01:06:21,290
In fact why don't we-- yeah,
let's worry about that later.

1087
01:06:24,150 --> 01:06:27,690
Let me first tell
you step 2.5 which

1088
01:06:27,690 --> 01:06:30,650
is I want there to be
zero collisions in each

1089
01:06:30,650 --> 01:06:31,720
of these tables.

1090
01:06:31,720 --> 01:06:34,170
It's only going to happen
with probability of 1/2

1091
01:06:34,170 --> 01:06:37,900
So if it doesn't
happen, just try again.

1092
01:06:37,900 --> 01:06:50,160
So 2.5 is while there's some
hash function h2,j that maps 2

1093
01:06:50,160 --> 01:07:02,310
keys that we're given to the
same slot at the second level,

1094
01:07:02,310 --> 01:07:17,290
this is for some j and let's
say ki different from ki prime.

1095
01:07:17,290 --> 01:07:20,310
But they map to the same place
by the first hash function.

1096
01:07:26,350 --> 01:07:29,680
So if two keys map to
the same secondary table

1097
01:07:29,680 --> 01:07:32,100
and there's a
conflict, then I'm just

1098
01:07:32,100 --> 01:07:36,020
going to redo that construction.

1099
01:07:36,020 --> 01:07:40,420
So I'm going to repick h2,j.

1100
01:07:40,420 --> 01:07:42,020
h2,j was a random choice.

1101
01:07:42,020 --> 01:07:47,230
So if I get a bad choice,
I'll just try another one.

1102
01:07:47,230 --> 01:07:50,045
Just keep randomly
choosing the a

1103
01:07:50,045 --> 01:07:51,910
or randomly choosing
this hash function

1104
01:07:51,910 --> 01:07:55,780
until there are zero collisions
in that secondary table.

1105
01:07:55,780 --> 01:07:57,920
And I'm going to do
this for each table.

1106
01:07:57,920 --> 01:08:00,600
So we worry about how
long these will take,

1107
01:08:00,600 --> 01:08:02,745
but I claim expected
constant number of trials.

1108
01:08:05,560 --> 01:08:07,250
So let's do the
second one first.

1109
01:08:13,040 --> 01:08:16,870
After we do this y loop
there are no collisions

1110
01:08:16,870 --> 01:08:19,050
with the proper notion of
the word collisions, which

1111
01:08:19,050 --> 01:08:21,750
is two different keys
mapping to the same value.

1112
01:08:35,970 --> 01:08:41,470
So at this point
we have guaranteed

1113
01:08:41,470 --> 01:08:43,220
that searches are
constant time worst

1114
01:08:43,220 --> 01:08:48,740
case after we do all these
4 steps because we apply h1,

1115
01:08:48,740 --> 01:08:51,029
we figure out which
slot we fit in.

1116
01:08:51,029 --> 01:08:53,930
Say it's slot j,
then we apply h2j

1117
01:08:53,930 --> 01:08:56,689
and if your item's
in the overall table,

1118
01:08:56,689 --> 01:08:58,410
it should be in that
secondary table.

1119
01:08:58,410 --> 01:09:00,243
Because there are no
collisions you can see,

1120
01:09:00,243 --> 01:09:02,130
is that one item the
one I'm looking for?

1121
01:09:02,130 --> 01:09:02,920
If so, return it.

1122
01:09:02,920 --> 01:09:04,699
If not, it's not anywhere.

1123
01:09:04,699 --> 01:09:07,829
If there are no
collisions then I

1124
01:09:07,829 --> 01:09:10,120
don't need chains coming out
of here because it is just

1125
01:09:10,120 --> 01:09:10,800
a single item.

1126
01:09:13,750 --> 01:09:16,760
The big question-- so
constant worst case space

1127
01:09:16,760 --> 01:09:19,130
because 1.5 guarantees that.

1128
01:09:19,130 --> 01:09:20,964
Constant worst case
time first search.

1129
01:09:20,964 --> 01:09:23,130
The big question is, how
long does it take to build?

1130
01:09:23,130 --> 01:09:25,330
How many times do
we have to redo

1131
01:09:25,330 --> 01:09:28,890
steps 1 and 2 before we
get a decent-- before we

1132
01:09:28,890 --> 01:09:30,130
get a perfect hash table.

1133
01:09:32,979 --> 01:09:35,880
So let me remind
you of the birthday

1134
01:09:35,880 --> 01:09:37,750
paradox, why it works here.

1135
01:09:54,530 --> 01:09:59,847
As mentioned earlier this is
going to be a union bounds.

1136
01:09:59,847 --> 01:10:01,680
We want to know the
probability of collision

1137
01:10:01,680 --> 01:10:02,930
at that second level.

1138
01:10:02,930 --> 01:10:06,754
Well that's at most the sum
of all possible collisions,

1139
01:10:06,754 --> 01:10:07,920
probabilities of collisions.

1140
01:10:07,920 --> 01:10:09,910
So I'm going to say
the sum over all i

1141
01:10:09,910 --> 01:10:14,340
not equal to ij of
the probability.

1142
01:10:14,340 --> 01:10:16,800
Now this is over our choice
of the hash function h2,j.

1143
01:10:19,848 --> 01:10:29,120
Of h2,j of ki equaling
h2,j of ki prime.

1144
01:10:29,120 --> 01:10:30,970
So union bounds says, of course.

1145
01:10:30,970 --> 01:10:33,080
The probability of any
of them happening--

1146
01:10:33,080 --> 01:10:35,380
we don't know about
interdependence or whatnot--

1147
01:10:35,380 --> 01:10:39,730
but certainly almost the sum of
each of these possible events.

1148
01:10:39,730 --> 01:10:42,150
There are a lot of
possible events.

1149
01:10:42,150 --> 01:10:43,620
If there 's li
things, that there

1150
01:10:43,620 --> 01:10:47,462
are going to be li choose
2 possible collisions

1151
01:10:47,462 --> 01:10:48,420
we have to worry about.

1152
01:10:48,420 --> 01:10:49,836
We know i is not
equal to i prime.

1153
01:10:53,360 --> 01:10:57,710
So the number of terms
here is li choose 2.

1154
01:11:00,890 --> 01:11:02,115
And what's this probability?

1155
01:11:06,120 --> 01:11:07,990
STUDENT: [INAUDIBLE]

1156
01:11:07,990 --> 01:11:14,420
ERIK DEMAINE: 1/li at most
because we're assuming h2,j is

1157
01:11:14,420 --> 01:11:17,880
a universal hash function so
the probability of choosing--

1158
01:11:17,880 --> 01:11:18,780
sorry?

1159
01:11:18,780 --> 01:11:19,530
li squared.

1160
01:11:19,530 --> 01:11:20,640
Thank you.

1161
01:11:20,640 --> 01:11:23,020
The size of the table.

1162
01:11:23,020 --> 01:11:27,230
1/m but m in this case, the
size of our table is li squared.

1163
01:11:27,230 --> 01:11:30,455
So the probability that we
choose a good hash function

1164
01:11:30,455 --> 01:11:32,790
and that these
particular keys don't hit

1165
01:11:32,790 --> 01:11:34,740
is at most 1/li squared.

1166
01:11:34,740 --> 01:11:37,740
This is basically li squared/ 2.

1167
01:11:37,740 --> 01:11:40,690
And so this is at most 1/2.

1168
01:11:40,690 --> 01:11:43,030
It's a slightly less
than li squared/2.

1169
01:11:43,030 --> 01:11:45,274
So this is at most 1/2.

1170
01:11:45,274 --> 01:11:46,940
And this is basically
a birthday paradox

1171
01:11:46,940 --> 01:11:48,375
in this particular case.

1172
01:11:48,375 --> 01:11:49,750
That means there
is a probability

1173
01:11:49,750 --> 01:11:53,610
of at least a half that there
is zero collisions in one

1174
01:11:53,610 --> 01:11:54,620
of these tables.

1175
01:11:54,620 --> 01:11:57,160
So that means I'm basically
flipping a fair coin.

1176
01:11:57,160 --> 01:11:58,922
If I ever get a heads I'm happy.

1177
01:11:58,922 --> 01:12:00,630
Each time I get a
tails I have to reflip.

1178
01:12:00,630 --> 01:12:03,060
This should sound
familiar from last time.

1179
01:12:03,060 --> 01:12:14,045
So this is 2 expected trials
or log n with high probability.

1180
01:12:20,710 --> 01:12:23,600
We've proved log n
with high probability.

1181
01:12:23,600 --> 01:12:26,360
That's the same as saying the
number of levels in a skip list

1182
01:12:26,360 --> 01:12:28,330
is log n with high probability.

1183
01:12:28,330 --> 01:12:30,959
How many times do I have to flip
a coin before I get a heads?

1184
01:12:30,959 --> 01:12:32,000
Definitely at most log n.

1185
01:12:35,620 --> 01:12:38,530
Now we have to do this
for each secondary table.

1186
01:12:38,530 --> 01:12:41,700
There are m equal theta
and secondary tables.

1187
01:12:50,110 --> 01:12:53,490
There's a slight question of how
big are the secondary tables.

1188
01:12:53,490 --> 01:12:56,770
If one of these tables
is like linear size,

1189
01:12:56,770 --> 01:12:59,600
then I have to spend
linear time for a trial.

1190
01:12:59,600 --> 01:13:02,450
And then I multiply that
by the number of trials

1191
01:13:02,450 --> 01:13:05,050
and also the number of different
things that would be like n

1192
01:13:05,050 --> 01:13:06,670
squared log n n.

1193
01:13:06,670 --> 01:13:11,460
But you know a secondary table
better not have linear sides.

1194
01:13:11,460 --> 01:13:14,450
I mean a linear
number of li equal n.

1195
01:13:14,450 --> 01:13:16,850
That would be bad because
then li squared is n squared

1196
01:13:16,850 --> 01:13:20,540
and we guaranteed that
we had linear space.

1197
01:13:20,540 --> 01:13:25,790
So in fact you can prove
with another Chernoff bound.

1198
01:13:25,790 --> 01:13:27,450
Let me put this over here.

1199
01:13:34,330 --> 01:13:36,940
That all the li's
are pretty small.

1200
01:13:36,940 --> 01:13:42,730
Not constant but logarithmic.

1201
01:13:42,730 --> 01:13:50,400
So li is order log n with
high probability for each i

1202
01:13:50,400 --> 01:13:51,850
and therefore for all i.

1203
01:13:51,850 --> 01:13:56,550
So I can just change the alpha
my minus 1 n to the alpha

1204
01:13:56,550 --> 01:14:00,654
and get that for
all i this happens.

1205
01:14:00,654 --> 01:14:02,570
In fact, the right answer
is log over log log,

1206
01:14:02,570 --> 01:14:04,620
if you want to do some
really messy analysis.

1207
01:14:04,620 --> 01:14:08,430
But we just, logarithmic
is fine for us.

1208
01:14:08,430 --> 01:14:10,920
So what this means
is we're doing

1209
01:14:10,920 --> 01:14:14,010
n different things
for each of them

1210
01:14:14,010 --> 01:14:17,960
with high probability
li is of size log n.

1211
01:14:17,960 --> 01:14:20,470
And then maybe we'll have
to do like log n trials

1212
01:14:20,470 --> 01:14:23,200
repeating until we get a
good hash function there.

1213
01:14:23,200 --> 01:14:29,320
And so the total build
time for steps 1 and 2.5

1214
01:14:29,320 --> 01:14:34,240
is going to be at most
n times log squared n.

1215
01:14:34,240 --> 01:14:37,420
You can prove a tighter
bound but it's polynomial.

1216
01:14:37,420 --> 01:14:41,200
That's all I wanted to go
for and it's almost linear.

1217
01:14:41,200 --> 01:14:46,855
So I'm left with one thing
to analyze which is step 1.5.

1218
01:14:46,855 --> 01:14:48,730
This to me is maybe the
most surprising thing

1219
01:14:48,730 --> 01:14:50,120
that it works out.

1220
01:14:50,120 --> 01:14:53,490
I mean here we designed--
we did this li to li

1221
01:14:53,490 --> 01:14:55,480
squared so the birthday
paradox would happen.

1222
01:14:55,480 --> 01:14:56,854
This is not surprising.

1223
01:14:56,854 --> 01:14:59,020
I mean it's a cool idea,
but once you have the idea,

1224
01:14:59,020 --> 01:15:01,370
it's not surprising
that it works.

1225
01:15:01,370 --> 01:15:03,370
What's a little more
surprising is that squaring

1226
01:15:03,370 --> 01:15:05,670
is OK from a space perspective.

1227
01:15:05,670 --> 01:15:07,310
1.5 says we're going
to have to rebuild

1228
01:15:07,310 --> 01:15:10,380
that first table until the
sum of these squared lengths

1229
01:15:10,380 --> 01:15:11,470
is at most linear.

1230
01:15:11,470 --> 01:15:13,540
I can guarantee
that each of these

1231
01:15:13,540 --> 01:15:16,840
is logarithmic so the sum of the
squares is at most like n log

1232
01:15:16,840 --> 01:15:17,950
squared n.

1233
01:15:17,950 --> 01:15:19,315
But I claim I can get linear.

1234
01:15:22,360 --> 01:15:25,580
Let's do that.

1235
01:15:25,580 --> 01:15:29,880
So for step 1.5
we're looking at what

1236
01:15:29,880 --> 01:15:35,410
is the expectation of the
sum of the lj squareds being

1237
01:15:35,410 --> 01:15:37,640
more than linear.

1238
01:15:37,640 --> 01:15:38,750
Sorry.

1239
01:15:38,750 --> 01:15:39,852
Expectation.

1240
01:15:39,852 --> 01:15:41,310
Let's first compute
the expectation

1241
01:15:41,310 --> 01:15:43,710
and then we'll talk
about a tail bound

1242
01:15:43,710 --> 01:15:45,420
which is the probability
that we're much

1243
01:15:45,420 --> 01:15:46,980
bigger than the expectation.

1244
01:15:46,980 --> 01:15:50,280
First thing is I claim
the expectation is linear.

1245
01:15:50,280 --> 01:15:56,000
So again whenever we're
counting something--

1246
01:15:56,000 --> 01:15:59,370
I mean this is basically
the total number of pairs

1247
01:15:59,370 --> 01:16:02,580
of items that collide
at the first level

1248
01:16:02,580 --> 01:16:05,060
with double counting.

1249
01:16:05,060 --> 01:16:08,940
So I mean if you think of lj
and then I make a complete graph

1250
01:16:08,940 --> 01:16:11,910
on those lj items,
that's going to have

1251
01:16:11,910 --> 01:16:14,400
like the squared
number of edges, so,

1252
01:16:14,400 --> 01:16:16,700
if I also multiply by 2.

1253
01:16:16,700 --> 01:16:19,060
So this is the same
thing as counting

1254
01:16:19,060 --> 01:16:25,180
how many pairs of items map to
the same spot, the same slot.

1255
01:16:25,180 --> 01:16:28,890
So this is going to-- and that
I can write as an indicator

1256
01:16:28,890 --> 01:16:30,820
random variable which
lets me use linearity

1257
01:16:30,820 --> 01:16:33,830
of expectation
which makes me happy

1258
01:16:33,830 --> 01:16:36,940
because then everything simple.

1259
01:16:36,940 --> 01:16:38,090
So I'm going to write Ii,j.

1260
01:16:41,210 --> 01:16:54,070
This is going to be 1 if each 1
of ki, I guess, equals h1 if kj

1261
01:16:54,070 --> 01:16:59,055
and it's going to be
0 if h1 otherwise.

1262
01:17:05,080 --> 01:17:07,080
This is the total number
of pairwise colliding

1263
01:17:07,080 --> 01:17:10,520
items including i versus i.

1264
01:17:10,520 --> 01:17:14,210
And so like if li equals
1, li squared is also 1.

1265
01:17:14,210 --> 01:17:15,820
There's 1 item
colliding with itself.

1266
01:17:15,820 --> 01:17:19,490
So this actually works exactly.

1267
01:17:19,490 --> 01:17:21,840
All right, with the wrong
definition of colliding.

1268
01:17:21,840 --> 01:17:24,140
If you bear with me.

1269
01:17:24,140 --> 01:17:26,735
So now we can use
linear of expectation

1270
01:17:26,735 --> 01:17:28,890
and put the E in here.

1271
01:17:28,890 --> 01:17:35,400
So this is sum i equals 1
to n sum j equals 1 to n

1272
01:17:35,400 --> 01:17:39,590
of the expectation of Ii,j.

1273
01:17:39,590 --> 01:17:42,982
But we know the expectation
of the Ii,j is the probability

1274
01:17:42,982 --> 01:17:45,440
of it equaling 1 because it's
an indicator random variable.

1275
01:17:45,440 --> 01:17:48,582
The probability of this
happening over our choice of h1

1276
01:17:48,582 --> 01:17:51,170
is at most 1/m by universality.

1277
01:17:51,170 --> 01:17:53,580
Here it actually is m because
we're at the first level.

1278
01:17:53,580 --> 01:17:59,520
So this is at most
1/m which is theta n.

1279
01:18:02,420 --> 01:18:10,930
So when i does not equal j,
so it's a little bit annoying.

1280
01:18:10,930 --> 01:18:16,570
I do have to separate out the Ii
terms from the i and different

1281
01:18:16,570 --> 01:18:17,994
i not equal to j terms.

1282
01:18:17,994 --> 01:18:19,660
But there's only-- I
mean it's basically

1283
01:18:19,660 --> 01:18:21,170
the diagonal of this matrix.

1284
01:18:21,170 --> 01:18:24,510
There's n things that will
always collide with themselves.

1285
01:18:24,510 --> 01:18:30,690
So we're going to get like
n plus the number of i

1286
01:18:30,690 --> 01:18:32,430
not equal to pairs
double counted.

1287
01:18:32,430 --> 01:18:35,520
So it's like 2 times n choose 2.

1288
01:18:35,520 --> 01:18:38,180
But we get to divide by m.

1289
01:18:38,180 --> 01:18:40,930
So this is like n squared /n.

1290
01:18:40,930 --> 01:18:46,070
So we get order n.

1291
01:18:46,070 --> 01:18:49,130
So that's not--
well, that's cool.

1292
01:18:49,130 --> 01:18:51,073
Expected space is linear.

1293
01:18:51,073 --> 01:18:52,531
This is what makes
everything work.

1294
01:18:59,410 --> 01:19:01,840
Last class was about getting
with high probability bounds

1295
01:19:01,840 --> 01:19:03,070
when we're working with logs.

1296
01:19:05,610 --> 01:19:07,570
When you want to
get that something

1297
01:19:07,570 --> 01:19:09,280
is log with high
probability, you

1298
01:19:09,280 --> 01:19:11,690
have to use, with
respect to n, you

1299
01:19:11,690 --> 01:19:13,670
have to use a turn off bound.

1300
01:19:13,670 --> 01:19:17,290
But this is about-- now I
want to show that the space is

1301
01:19:17,290 --> 01:19:19,090
linear with high probability.

1302
01:19:19,090 --> 01:19:20,630
Linear is actually really easy.

1303
01:19:20,630 --> 01:19:24,560
You can use a much weaker
bound called Markov inequality.

1304
01:19:24,560 --> 01:19:36,200
So I want to claim that the
probability of h1 of this thing

1305
01:19:36,200 --> 01:19:39,980
lj squareds being bigger
than some constant times

1306
01:19:39,980 --> 01:19:49,624
n is at most the expectation
of that thing divided by cn.

1307
01:19:49,624 --> 01:19:50,790
This is Markov's inequality.

1308
01:19:50,790 --> 01:19:52,710
It holds for anything here.

1309
01:19:52,710 --> 01:19:54,320
So I'm just repeating
it over here.

1310
01:19:58,550 --> 01:20:05,170
So this is nice because we
know that this expectation is

1311
01:20:05,170 --> 01:20:06,970
linear.

1312
01:20:06,970 --> 01:20:12,000
So we're getting like a
linear function divided by cn.

1313
01:20:12,000 --> 01:20:14,100
Remember we get to choose c.

1314
01:20:14,100 --> 01:20:16,500
The step said if it's bigger
than some constant times n

1315
01:20:16,500 --> 01:20:18,050
then we're redoing the thing.

1316
01:20:18,050 --> 01:20:20,270
So I can choose c
to be 100, whatever.

1317
01:20:20,270 --> 01:20:23,870
I'm going to choose it to
be twice this constant.

1318
01:20:23,870 --> 01:20:27,460
And then this is at most half.

1319
01:20:27,460 --> 01:20:29,600
So the probability of
my space being too big

1320
01:20:29,600 --> 01:20:30,670
is at most a half.

1321
01:20:30,670 --> 01:20:31,980
We're back to coin flipping.

1322
01:20:31,980 --> 01:20:33,590
Every time I flip
a coin, if I get

1323
01:20:33,590 --> 01:20:40,550
heads I have the right amount
of space at less than c times n

1324
01:20:40,550 --> 01:20:42,100
space.

1325
01:20:42,100 --> 01:20:43,650
If I get a tails I try again.

1326
01:20:43,650 --> 01:20:50,000
So the expected number
of trials is 2 at most

1327
01:20:50,000 --> 01:20:53,710
not trails, trials.

1328
01:20:53,710 --> 01:20:57,605
And it's also log n trials
with high probability.

1329
01:21:01,510 --> 01:21:03,480
How much time do I
spend for each trial?

1330
01:21:03,480 --> 01:21:04,186
Linear time.

1331
01:21:04,186 --> 01:21:05,310
I choose one hash function.

1332
01:21:05,310 --> 01:21:06,850
I hash all the items.

1333
01:21:06,850 --> 01:21:10,120
I count the number of collision
squared or the sum of lj

1334
01:21:10,120 --> 01:21:10,620
squared.

1335
01:21:10,620 --> 01:21:12,420
That takes linear time to do.

1336
01:21:12,420 --> 01:21:16,150
And so the total work I'm doing
for these steps is n log n.

1337
01:21:20,920 --> 01:21:23,710
So n log n to do
step 1 and 1 prime

1338
01:21:23,710 --> 01:21:27,000
and log squared n to
do steps 2 and 2 prime.

1339
01:21:27,000 --> 01:21:30,940
Overall n Polylog
or polynomial time.

1340
01:21:30,940 --> 01:21:35,190
And we get guaranteed no
collisions for static data.

1341
01:21:35,190 --> 01:21:39,714
Constant worst case search
and linear worst case space.

1342
01:21:39,714 --> 01:21:41,630
This is kind of surprising
that this works out

1343
01:21:41,630 --> 01:21:43,049
but everything's nice.

1344
01:21:47,780 --> 01:21:49,910
Now you know hashing.