1
00:00:09,000 --> 00:00:10,000
Hashing.

2
00:00:15,000 --> 00:00:19,000
Today we're going to do some
amazing stuff with hashing.

3
00:00:19,000 --> 00:00:21,000
And, really,
this is such neat stuff,

4
00:00:21,000 --> 00:00:24,000
it's amazing.
We're going to start by

5
00:00:24,000 --> 00:00:28,000
addressing a fundamental
weakness of hashing.

6
00:00:34,000 --> 00:00:37,000
And that is that for any choice
of hash function

7
00:00:49,000 --> 00:01:04,000
There exists a bad set of keys
that all hash to the same slot.

8
00:01:09,000 --> 00:01:11,000
OK.
So you pick a hash function.

9
00:01:11,000 --> 00:01:15,000
We looked at some that seem to
work well in practice,

10
00:01:15,000 --> 00:01:18,000
that are easy to put into your
code.

11
00:01:18,000 --> 00:01:23,000
But whichever one you pick,
there's always some bad set of

12
00:01:23,000 --> 00:01:25,000
keys.
So you can imagine,

13
00:01:25,000 --> 00:01:30,000
just to drive this point home a
little bit.

14
00:01:30,000 --> 00:01:35,000
Imagine that you're building a
compiler for a customer and you

15
00:01:35,000 --> 00:01:40,000
have a symbol table in your
compiler and one of the things

16
00:01:40,000 --> 00:01:46,000
that the customer is demanding
is that compilations go fast.

17
00:01:46,000 --> 00:01:50,000
They don't want to sit around
waiting for compilations.

18
00:01:50,000 --> 00:01:56,000
And you have a competitor who's
also building a compiler and

19
00:01:56,000 --> 00:02:01,000
they're going to test the
compiler, both of your compilers

20
00:02:01,000 --> 00:02:07,000
and sort of have a run-off.
And one of the things in the

21
00:02:07,000 --> 00:02:12,000
test that they're going to allow
you to do is not only will the

22
00:02:12,000 --> 00:02:16,000
customer run his own benchmarks,
but he'll let you make up

23
00:02:16,000 --> 00:02:20,000
benchmarks for the other
program, for your competitor.

24
00:02:20,000 --> 00:02:24,000
And your competitor gets to
make up benchmarks for you.

25
00:02:24,000 --> 00:02:28,000
So and not only that,
but you're actually sharing

26
00:02:28,000 --> 00:02:32,000
code.
So you get to look at what the

27
00:02:32,000 --> 00:02:37,000
competitor is actually doing and
what hash function they're

28
00:02:37,000 --> 00:02:40,000
actually using.
So it's pretty clear that in

29
00:02:40,000 --> 00:02:44,000
this circumstance,
you have an adversary who is

30
00:02:44,000 --> 00:02:49,000
going to look at whatever hash
function you have and figure out

31
00:02:49,000 --> 00:02:53,000
OK, what's a set of variable
names and so forth that are

32
00:02:53,000 --> 00:02:58,000
going to all hash to the same
slot so that essentially you're

33
00:02:58,000 --> 00:03:03,000
just chasing through a linked
list whenever it comes to

34
00:03:03,000 --> 00:03:07,000
looking something up.
Slowing down your program

35
00:03:07,000 --> 00:03:12,000
enormously compared to if in
fact they got distributed nicely

36
00:03:12,000 --> 00:03:15,000
across the hash table which is,
what after all,

37
00:03:15,000 --> 00:03:19,000
you have a hash table in there
to do in the first place.

38
00:03:19,000 --> 00:03:22,000
And so the question is,
how do you defeat this

39
00:03:22,000 --> 00:03:26,000
adversary?
And the answer is one word.

40
00:03:31,000 --> 00:03:33,000
One word.
How do you achieve?

41
00:03:33,000 --> 00:03:37,000
How do you defeat any adversary
in this class?

42
00:03:37,000 --> 00:03:38,000
Randomness.
OK.

43
00:03:38,000 --> 00:03:39,000
Randomness.
OK.

44
00:03:39,000 --> 00:03:42,000
You make it so that he can't
guess.

45
00:03:42,000 --> 00:03:47,000
And the idea is that you choose
a hash function at random.

46
00:03:47,000 --> 00:03:50,000
Independent.
So he can look at the code,

47
00:03:50,000 --> 00:03:55,000
but when it actually runs,
it's going to use a random hash

48
00:03:55,000 --> 00:04:00,000
function that he has no way of
predicting what the hash

49
00:04:00,000 --> 00:04:05,000
function is that will actually
be used.

50
00:04:05,000 --> 00:04:07,000
OK.
So that's the game and that way

51
00:04:07,000 --> 00:04:11,000
he can provide an input,
but he can't provide an input

52
00:04:11,000 --> 00:04:15,000
that's guaranteed to force you
to run slowly.

53
00:04:15,000 --> 00:04:19,000
You might get unlucky in your
choice of hash function,

54
00:04:19,000 --> 00:04:23,000
but it's not going to be
because of the adversary.

55
00:04:23,000 --> 00:04:28,000
So the idea is to choose a hash
function --

56
00:04:34,000 --> 00:04:38,000
-- at random,
independently from the keys

57
00:04:38,000 --> 00:04:42,000
that you're, that are going to
be fed to it.

58
00:04:42,000 --> 00:04:47,000
So even if your adversary can
see your code,

59
00:04:47,000 --> 00:04:53,000
he can't tell which hash
function is going to be actually

60
00:04:53,000 --> 00:04:58,000
used at run time.
Doesn't get to see the output

61
00:04:58,000 --> 00:05:04,000
of the random numbers.
And so it turns out you can

62
00:05:04,000 --> 00:05:11,000
make this scheme work and the
name of the scheme is universal

63
00:05:11,000 --> 00:05:17,000
hashing, OK, is one way of
making this scheme work.

64
00:05:22,000 --> 00:05:34,000
So let's do some math.
So let U be a universe of keys.

65
00:05:34,000 --> 00:05:41,000
And let H be a finite
collection --

66
00:05:48,000 --> 00:05:49,000
-- of hash functions --

67
00:05:56,000 --> 00:06:04,000
-- mapping U to what are going
to be the slots in our hash

68
00:06:04,000 --> 00:06:06,000
table.
OK.

69
00:06:06,000 --> 00:06:11,000
So we just have H as some
finite collection.

70
00:06:11,000 --> 00:06:15,000
We say that H is universal --

71
00:06:22,000 --> 00:06:30,000
-- if for all pairs of the
keys, distinct keys --

72
00:06:36,000 --> 00:06:41,000
-- so the keys are distinct,
the following is true.

73
00:07:03,000 --> 00:07:08,000
So if the set of keys,
if for any pair of keys I pick,

74
00:07:08,000 --> 00:07:15,000
the number of hash functions
that hash those two keys to the

75
00:07:15,000 --> 00:07:21,000
same place is a one over m
fraction of the total set of

76
00:07:21,000 --> 00:07:23,000
keys.
So let m just,

77
00:07:23,000 --> 00:07:28,000
so to view that,
another way of viewing that is

78
00:07:28,000 --> 00:07:33,000
if H is chosen randomly --

79
00:07:39,000 --> 00:07:51,000
-- from the set of keys H,
the probability of collision

80
00:07:51,000 --> 00:07:58,000
between x and y is what?

81
00:08:12,000 --> 00:08:17,000
What's the probability if the
fraction of hash functions,

82
00:08:17,000 --> 00:08:22,000
OK, if the number of hash
functions is H over m,

83
00:08:22,000 --> 00:08:27,000
what's the probability of a
collision between x and y?

84
00:08:27,000 --> 00:08:32,000
If I pick a hash function at
random.

85
00:08:32,000 --> 00:08:39,000
So I pick a hash function at
random, what's the odds they

86
00:08:39,000 --> 00:08:42,000
collide?
One over m.

87
00:08:42,000 --> 00:08:49,000
Now let's draw a picture for
that, help people see that

88
00:08:49,000 --> 00:08:56,000
that's in fact the case.
So imagine this is our set of

89
00:08:56,000 --> 00:09:00,000
all hash functions.
OK.

90
00:09:00,000 --> 00:09:08,000
And then if I pick a particular
x and y, let's say that this is

91
00:09:08,000 --> 00:09:16,000
the set of hash functions such
that H of x is equal to H of y.

92
00:09:16,000 --> 00:09:23,000
And so what we're saying is
that the cardinality of that set

93
00:09:23,000 --> 00:09:30,000
is one over m times the
cardinality of H.

94
00:09:30,000 --> 00:09:33,000
So if I throw a dart and pick
one hash function at random,

95
00:09:33,000 --> 00:09:37,000
the odds are one in m that the
hash function falls into this

96
00:09:37,000 --> 00:09:39,000
particular set.
And of course,

97
00:09:39,000 --> 00:09:43,000
this has to be true of every x
and y that I can pick.

98
00:09:43,000 --> 00:09:45,000
Of course, it will be a
different set,

99
00:09:45,000 --> 00:09:49,000
a different x and y will
somehow map the hash functions

100
00:09:49,000 --> 00:09:52,000
differently, but the odds that
for any x and y that I pick,

101
00:09:52,000 --> 00:09:55,000
the odds that if I have a
random hash function,

102
00:09:55,000 --> 00:10:00,000
it hashes it to the same place,
is one over m.

103
00:10:00,000 --> 00:10:03,000
Now this is a little bit hard
sometimes for people to get

104
00:10:03,000 --> 00:10:07,000
their head around because we're
used to thinking of perhaps

105
00:10:07,000 --> 00:10:09,000
picking keys at random or
something.

106
00:10:09,000 --> 00:10:11,000
OK, that's not what's going on
here.

107
00:10:11,000 --> 00:10:14,000
We're picking hash functions at
random.

108
00:10:14,000 --> 00:10:18,000
So our probability space is
defined over the hash functions,

109
00:10:18,000 --> 00:10:21,000
not over the keys.
And this has to be true now for

110
00:10:21,000 --> 00:10:24,000
any particular two keys that I
pick that are distinct.

111
00:10:24,000 --> 00:10:28,000
That the places that they hash,
this set of hash functions,

112
00:10:28,000 --> 00:10:34,000
I mean this is like a marvelous
property if you think about it.

113
00:10:34,000 --> 00:10:39,000
OK, that you can actually find
ones where no matter what two

114
00:10:39,000 --> 00:10:43,000
elements I pick,
the odds are exactly one in m

115
00:10:43,000 --> 00:10:48,000
that a random hash function from
this set is going to hash them

116
00:10:48,000 --> 00:10:51,000
to the same place.
So very neat.

117
00:10:51,000 --> 00:10:56,000
Very, very neat property and
we'll see the mathematics

118
00:10:56,000 --> 00:11:00,000
associated with this is very
cool.

119
00:11:00,000 --> 00:11:14,000
So our theorem is that if we
choose h randomly from the set

120
00:11:14,000 --> 00:11:25,000
of hash functions H,
and then we suppose we're

121
00:11:25,000 --> 00:11:37,000
hashing n keys into m slots in
Table T --

122
00:11:44,000 --> 00:11:46,000
-- then for given key x --

123
00:11:52,000 --> 00:11:56,000
-- the expected number of
collisions with x --

124
00:12:03,000 --> 00:12:12,000
-- is less than n over m.
And who remembers what we call

125
00:12:12,000 --> 00:12:16,000
n over m?
Alpha, which is the,

126
00:12:16,000 --> 00:12:22,000
what's the term that we use
there?

127
00:12:22,000 --> 00:12:30,000
Load factor.
The load factor of the table.

128
00:12:30,000 --> 00:12:36,000
OK, load factor alpha.
So the average number of keys

129
00:12:36,000 --> 00:12:42,000
per slot is the load factor of
the table.

130
00:12:42,000 --> 00:12:48,000
So we're saying,
so what is this theorem saying?

131
00:12:48,000 --> 00:12:55,000
It's saying that in fact,
if we have one of these

132
00:12:55,000 --> 00:13:02,000
universal sets of hash
functions, then things perform

133
00:13:02,000 --> 00:13:10,000
exactly the way we want them to.
Things get distributed evenly.

134
00:13:10,000 --> 00:13:15,000
The number of things that are
going to collide with any

135
00:13:15,000 --> 00:13:19,000
particular key that I pick is
going to be n over m.

136
00:13:19,000 --> 00:13:22,000
So that's a really good
property to have.

137
00:13:22,000 --> 00:13:27,000
Now I haven't shown you,
the construction of U is going,

138
00:13:27,000 --> 00:13:31,000
sorry, of the set of hash
functions H, that that

139
00:13:31,000 --> 00:13:36,000
construction will take us a
little bit of effort.

140
00:13:36,000 --> 00:13:39,000
But first I want to show you
why this is such a great

141
00:13:39,000 --> 00:13:42,000
property.
Basically it's this theorem.

142
00:13:42,000 --> 00:13:46,000
So let's prove this theorem.
So any questions about what the

143
00:13:46,000 --> 00:13:50,000
statement of the theorem is?
So we're going to go actually

144
00:13:50,000 --> 00:13:54,000
kind of fast today.
We've got a lot of good stuff

145
00:13:54,000 --> 00:13:57,000
today.
So I want to make sure people

146
00:13:57,000 --> 00:14:03,000
are onboard as we go through.
So if there are any questions,

147
00:14:03,000 --> 00:14:07,000
make sure, you know,
statement of theorem of

148
00:14:07,000 --> 00:14:13,000
whatever, best to get them out
early so that you're not

149
00:14:13,000 --> 00:14:19,000
confused later on when the going
gets a little more exciting.

150
00:14:19,000 --> 00:14:21,000
OK?
OK, good.

151
00:14:21,000 --> 00:14:26,000
So to prove this,
let's let C sub x be the random

152
00:14:26,000 --> 00:14:33,000
variable denoting the total
number of collisions --

153
00:14:38,000 --> 00:14:44,000
-- of keys in T with x.
So this is a total number and

154
00:14:44,000 --> 00:14:51,000
one of the techniques that you
use a lot in probabilistic

155
00:14:51,000 --> 00:14:57,000
analysis of randomized
algorithms is recognizing that C

156
00:14:57,000 --> 00:15:05,000
of x is in fact a sum of
indicator random variables.

157
00:15:05,000 --> 00:15:11,000
If you can decompose things
into indicator random variables,

158
00:15:11,000 --> 00:15:17,000
the analysis goes much more
easily than if you're left with

159
00:15:17,000 --> 00:15:22,000
aggregate variables.
So here we're going to let our

160
00:15:22,000 --> 00:15:27,000
indicator random variable be
little c of x.,

161
00:15:27,000 --> 00:15:32,000
which is going to be one if h
of x equals h of y and 0

162
00:15:32,000 --> 00:15:35,000
otherwise.

163
00:15:40,000 --> 00:15:49,000
And so we can note two things.
First, what is the expectation

164
00:15:49,000 --> 00:15:52,000
of C of x..

165
00:15:57,000 --> 00:16:00,000
OK, if I have a process which
is picking a hash function at

166
00:16:00,000 --> 00:16:04,000
random, what's the expectation
of C of x.?

167
00:16:04,000 --> 00:16:07,000
One over m.
Because that's basically this

168
00:16:07,000 --> 00:16:11,000
definition here.
Now in other words I pick a

169
00:16:11,000 --> 00:16:16,000
hash function at random,
what's the odds that the hash

170
00:16:16,000 --> 00:16:19,000
is the same?
It's one over m.

171
00:16:19,000 --> 00:16:24,000
And then the other thing is,
and the reason we pick this

172
00:16:24,000 --> 00:16:28,000
thing is that I can express
capital C sub x,

173
00:16:28,000 --> 00:16:33,000
the random variable denoting
the total number of collisions

174
00:16:33,000 --> 00:16:39,000
as being just the sum over all
the keys in the table except x

175
00:16:39,000 --> 00:16:46,000
of C of x..
So for each one that would

176
00:16:46,000 --> 00:16:53,000
cause me a collision,
with x, I add one and if it

177
00:16:53,000 --> 00:17:00,000
wouldn't cause me a collision,
I add 0.

178
00:17:00,000 --> 00:17:06,000
And that adds up all of the
collisions that I would have in

179
00:17:06,000 --> 00:17:09,000
the table with x.

180
00:17:17,000 --> 00:17:20,000
Is there any questions so far?
Because this is the set-up.

181
00:17:20,000 --> 00:17:24,000
The set-up in most of these
things, the set-up is where most

182
00:17:24,000 --> 00:17:27,000
students make mistakes and most
practicing researchers make

183
00:17:27,000 --> 00:17:30,000
mistakes as well,
let me tell you.

184
00:17:30,000 --> 00:17:32,000
And then once you get the
set-up right,

185
00:17:32,000 --> 00:17:36,000
then working out the math is
fine, but it's often that set-up

186
00:17:36,000 --> 00:17:40,000
of how do you actually translate
the situation into the math.

187
00:17:40,000 --> 00:17:43,000
That's the hard part.
Once you get that right,

188
00:17:43,000 --> 00:17:46,000
well, then, algebra,
we can all do algebra.

189
00:17:46,000 --> 00:17:49,000
Of course, we can also all make
mistakes doing algebra,

190
00:17:49,000 --> 00:17:53,000
but at least those mistakes are
much more easy to check than the

191
00:17:53,000 --> 00:17:57,000
one that does the translation.
So I want to make sure people

192
00:17:57,000 --> 00:18:00,000
are sort of understanding of how
that's set up.

193
00:18:00,000 --> 00:18:05,000
So now we just have to use our
math skills.

194
00:18:05,000 --> 00:18:12,000
So the expectation then of the
number of collisions is the

195
00:18:12,000 --> 00:18:18,000
expectation of C sub x and
that's just the expectation of

196
00:18:18,000 --> 00:18:26,000
just plugging the sum of y and T
minus the element x of c_xy.

197
00:18:26,000 --> 00:18:33,000
So that's just definition.
And that's equal to the sum of

198
00:18:33,000 --> 00:18:39,000
y and T minus x of expectation
of c_xy.

199
00:18:39,000 --> 00:18:44,000
So why is that?
Yeah, that's linearity.

200
00:18:52,000 --> 00:18:56,000
Linearity of expectation,
doesn't require independence.

201
00:18:56,000 --> 00:19:00,000
It's true of all random
variables.

202
00:19:00,000 --> 00:19:07,000
And that's equal to,
and now the math gets easier.

203
00:19:07,000 --> 00:19:10,000
So what is that?
One over m.

204
00:19:10,000 --> 00:19:16,000
That makes the summation easy
to evaluate.

205
00:19:16,000 --> 00:19:22,000
That's just n minus one over m.

206
00:19:30,000 --> 00:19:35,000
So fairly simple analysis and
shows you why we would love to

207
00:19:35,000 --> 00:19:41,000
have one of these sets of
universal hash functions because

208
00:19:41,000 --> 00:19:45,000
if you have them,
then they behave exactly the

209
00:19:45,000 --> 00:19:51,000
way you would want it to behave.
And you defeat your adversary

210
00:19:51,000 --> 00:19:55,000
by just picking up the hash
function at random.

211
00:19:55,000 --> 00:20:00,000
There's nothing he can do.
Or she.

212
00:20:00,000 --> 00:20:02,000
OK, any questions about that
proof?

213
00:20:02,000 --> 00:20:04,000
OK, now we get into the fun
math.

214
00:20:04,000 --> 00:20:07,000
Constructing one of these
babies.

215
00:20:07,000 --> 00:20:08,000
OK.

216
00:20:20,000 --> 00:20:23,000
This is not the only
construction.

217
00:20:23,000 --> 00:20:31,000
This is a construction of a
classic universal hash function.

218
00:20:31,000 --> 00:20:37,000
And there are other
constructions in the literature

219
00:20:37,000 --> 00:20:42,000
and I think there's one on the
practice quiz.

220
00:20:42,000 --> 00:20:47,000
So let's see.
So this one works when m is

221
00:20:47,000 --> 00:20:51,000
prime.
So it works when the set of

222
00:20:51,000 --> 00:20:57,000
slots is a prime number.
Number of slots is a prime

223
00:20:57,000 --> 00:21:05,000
number.
So the idea here is we're going

224
00:21:05,000 --> 00:21:16,000
to decompose any key k in our
universe into r plus 1 digits.

225
00:21:16,000 --> 00:21:25,000
So k, we're going to look at as
being a k 0, k one,

226
00:21:25,000 --> 00:21:33,000
k_r where 0 is less than or
equal to k sub I,

227
00:21:33,000 --> 00:21:41,000
is less than or equal to m
minus one.

228
00:21:41,000 --> 00:21:47,000
So the idea is in some sense
we're looking at what the

229
00:21:47,000 --> 00:21:52,000
representation would be of k
base m.

230
00:21:52,000 --> 00:21:58,000
So if it were base two,
it would be just one bit at a

231
00:21:58,000 --> 00:22:01,000
time.
These would just be the bits.

232
00:22:01,000 --> 00:22:05,000
I'm not going to do base two.
We're going to do base min

233
00:22:05,000 --> 00:22:09,000
general and so each of these
represents one of the digits.

234
00:22:09,000 --> 00:22:13,000
And the way I've done it is
I've done low order digit first.

235
00:22:13,000 --> 00:22:16,000
It actually doesn't matter.
We're not actually going to

236
00:22:16,000 --> 00:22:20,000
care really about what the order
is, but basically we're just

237
00:22:20,000 --> 00:22:24,000
looking at busting it into a
twofold represented by each of

238
00:22:24,000 --> 00:22:27,000
those digits.
So one algorithm for computing

239
00:22:27,000 --> 00:22:31,000
this out of k is take the
remainder mod m.

240
00:22:31,000 --> 00:22:34,000
That's the low order one.
OK, take what's left.

241
00:22:34,000 --> 00:22:37,000
Take the remainder of that mod
m.

242
00:22:37,000 --> 00:22:39,000
Take whatever's left,
etc.

243
00:22:39,000 --> 00:22:42,000
So you're familiar with the
conversion to a base

244
00:22:42,000 --> 00:22:46,000
representation.
That's exactly how we're

245
00:22:46,000 --> 00:22:49,000
getting this representation.
So we treat,

246
00:22:49,000 --> 00:22:53,000
this is just a question of
taking the data that we've got

247
00:22:53,000 --> 00:22:57,000
and treating it as an r plus one
base m number.

248
00:22:57,000 --> 00:23:02,000
And now we invoke our
randomized strategy.

249
00:23:02,000 --> 00:23:05,000
The randomized strategy is
going to be able to have a class

250
00:23:05,000 --> 00:23:09,000
of hash functions that's
dependent essentially on random

251
00:23:09,000 --> 00:23:11,000
numbers.
And the random numbers we're

252
00:23:11,000 --> 00:23:15,000
going to pick is we're going to
pick an a at random --

253
00:23:28,000 --> 00:23:33,000
-- which we're also going to
look at as a base mnumber.

254
00:23:33,000 --> 00:23:38,000
For each a_i is chosen randomly
--

255
00:23:49,000 --> 00:23:50,000
-- from --

256
00:23:55,000 --> 00:23:58,000
-- 0 to m minus one.
So one of our,

257
00:23:58,000 --> 00:24:03,000
it's a random if you will,
it's a random base mdigit.

258
00:24:03,000 --> 00:24:06,000
Random base m digit.
So each one of these is picked

259
00:24:06,000 --> 00:24:09,000
at random.
And for each one we,

260
00:24:09,000 --> 00:24:13,000
possible value of A,
we're going to get a different

261
00:24:13,000 --> 00:24:16,000
hash function.
So we're going to index our

262
00:24:16,000 --> 00:24:19,000
hash functions by this random
number.

263
00:24:19,000 --> 00:24:23,000
So this is where the randomness
is going to come in.

264
00:24:23,000 --> 00:24:28,000
Everybody with me?
And here's the hash function.

265
00:24:56,000 --> 00:25:06,000
So what we do is we dot product
this vector with this vector and

266
00:25:06,000 --> 00:25:11,000
take the result,
mod m.

267
00:25:11,000 --> 00:25:18,000
So each digit of k of our key
gets multiplied by a random

268
00:25:18,000 --> 00:25:25,000
other digit.
We add all those up and we take

269
00:25:25,000 --> 00:25:29,000
that mod m.
So that's a dot product

270
00:25:29,000 --> 00:25:34,000
operator.
And this is what we're going to

271
00:25:34,000 --> 00:25:37,000
show is universal,
that this set of h sub a,

272
00:25:37,000 --> 00:25:39,000
where I look over that whole
set.

273
00:25:39,000 --> 00:25:44,000
So one of the things we need to
know is how big is the set of

274
00:25:44,000 --> 00:25:46,000
hash functions here.

275
00:25:59,000 --> 00:26:01,000
So how big is this set of hash
functions?

276
00:26:01,000 --> 00:26:07,000
How many different hash
functions do I have in this set?

277
00:26:24,000 --> 00:26:31,000
It's basic 6.042 material.
It's basically how many vectors

278
00:26:31,000 --> 00:26:38,000
of length r plus one where each
element of the vector is a

279
00:26:38,000 --> 00:26:45,000
number of 0 to m minus one,
has m different values.

280
00:26:45,000 --> 00:26:50,000
So how many?
m minus one to the r.

281
00:26:50,000 --> 00:26:51,000
No.
Close.

282
00:26:51,000 --> 00:26:56,000
It's up there.
It's a big number.

283
00:26:56,000 --> 00:27:01,000
m to the r plus one.
Good.

284
00:27:01,000 --> 00:27:06,000
It's m, so the size of H is
equal to m to the r plus one.

285
00:27:06,000 --> 00:27:10,000
So we're going to want to
remember that.

286
00:27:10,000 --> 00:27:13,000
OK, so let's just understand
why that is.

287
00:27:13,000 --> 00:27:17,000
I have m choices for the first
value of A.

288
00:27:17,000 --> 00:27:19,000
m for the second,
etc.

289
00:27:19,000 --> 00:27:23,000
m for the r th.
And since there are plus one

290
00:27:23,000 --> 00:27:28,000
things here, for each choice
here, I have this many same

291
00:27:28,000 --> 00:27:34,000
number of choices here,
so it's a product.

292
00:27:34,000 --> 00:27:39,000
OK, so this is the product rule
in counting.

293
00:27:39,000 --> 00:27:45,000
So if you haven't reviewed your
6.042 notes for counting,

294
00:27:45,000 --> 00:27:52,000
this is going to be a good idea
to go back and review that

295
00:27:52,000 --> 00:27:57,000
because we're doing stuff of
that nature.

296
00:27:57,000 --> 00:28:01,000
This is just the product rule.
Good.

297
00:28:01,000 --> 00:28:10,000
So then the theorem we want to
prove is that H is universal.

298
00:28:10,000 --> 00:28:14,000
And this is going to involve a
little bit of number theory,

299
00:28:14,000 --> 00:28:19,000
so it gets kind of interesting.
And it's a non-trivial proof,

300
00:28:19,000 --> 00:28:23,000
so this is where if there's any
questions as I'm going along,

301
00:28:23,000 --> 00:28:28,000
please ask because the argument
is not as simple as other

302
00:28:28,000 --> 00:28:33,000
arguments we've seen so far.
OK, not the ones we've seen so

303
00:28:33,000 --> 00:28:38,000
far have been simple,
but this is definitely a more

304
00:28:38,000 --> 00:28:43,000
involved mathematical argument.
So here's a proof.

305
00:28:43,000 --> 00:28:46,000
So let's let,
so we have two keys.

306
00:28:46,000 --> 00:28:50,000
What are we trying to show if
it's universal,

307
00:28:50,000 --> 00:28:55,000
that if I pick any two keys,
the number of hash functions

308
00:28:55,000 --> 00:29:01,000
for which they hash to the same
thing is the size of set of hash

309
00:29:01,000 --> 00:29:08,000
functions divided by m.
OK, so I'm going to look at two

310
00:29:08,000 --> 00:29:11,000
keys.
So let's pick two keys

311
00:29:11,000 --> 00:29:16,000
arbitrarily.
So x, and we'll decompose it

312
00:29:16,000 --> 00:29:23,000
into our base r representation
and y, y_0, y_1 --

313
00:29:33,000 --> 00:29:39,000
So these are two distinct keys.
So if these are two distinct

314
00:29:39,000 --> 00:29:45,000
keys, so they're different,
then this base representation

315
00:29:45,000 --> 00:29:50,000
has the property that they've
got to differ somewhere.

316
00:29:50,000 --> 00:29:54,000
Right?
OK, they differ in at least one

317
00:29:54,000 --> 00:29:56,000
digit.

318
00:30:08,000 --> 00:30:12,000
OK, and this is where most
people get lost because I'm

319
00:30:12,000 --> 00:30:16,000
going to make a simplification.
They could differ in any one of

320
00:30:16,000 --> 00:30:20,000
these digits.
I'm going to say they differ in

321
00:30:20,000 --> 00:30:24,000
position 0 because it doesn't
matter which one I do,

322
00:30:24,000 --> 00:30:28,000
the math is the same,
but it'll make it so that if I

323
00:30:28,000 --> 00:30:31,000
pick some said they differ in
some position i,

324
00:30:31,000 --> 00:30:35,000
I would have to be taking
summations as you'll see over

325
00:30:35,000 --> 00:30:41,000
the elements that are not i,
and that's complicated.

326
00:30:41,000 --> 00:30:44,000
If I do it in position 0,
then I can just sum for the

327
00:30:44,000 --> 00:30:46,000
rest of them.
So the math is going to be

328
00:30:46,000 --> 00:30:50,000
identical if I were to do it for
any position because it's

329
00:30:50,000 --> 00:30:52,000
symmetric.
All the digits are symmetric.

330
00:30:52,000 --> 00:30:56,000
So let's say they differ in
position 0, but the same

331
00:30:56,000 --> 00:30:59,000
argument is going to be true if
they differed in some other

332
00:30:59,000 --> 00:31:02,000
position.
So let's say,

333
00:31:02,000 --> 00:31:05,000
so we're saying without loss of
generality.

334
00:31:05,000 --> 00:31:08,000
So that's without loss of
generality.

335
00:31:08,000 --> 00:31:12,000
Position 0.
Because all the positions are

336
00:31:12,000 --> 00:31:16,000
symmetric here.
And so, now we need to ask the

337
00:31:16,000 --> 00:31:19,000
question for how many --

338
00:31:24,000 --> 00:31:30,000
-- hash functions in our
universal, purportedly universal

339
00:31:30,000 --> 00:31:34,000
set do x and y collide?

340
00:31:39,000 --> 00:31:42,000
OK, we've got to count them up.
So how often do they collide?

341
00:31:42,000 --> 00:31:46,000
This is where we're going to
pull out some heavy duty number

342
00:31:46,000 --> 00:31:48,000
theory.
So we must have,

343
00:31:48,000 --> 00:31:50,000
if they collide --

344
00:31:56,000 --> 00:32:03,000
-- that h sub a of x is equal
to h sub a of y.

345
00:32:03,000 --> 00:32:09,000
That's what it means for them
to collide.

346
00:32:09,000 --> 00:32:20,000
So that implies that the sum of
i equal 0 to r of a sub i x sub

347
00:32:20,000 --> 00:32:30,000
i is equal to the sum of i
equals 0 to r of a sub i y sub i

348
00:32:30,000 --> 00:32:35,000
mod m.
Actually this is congruent mod

349
00:32:35,000 --> 00:32:38,000
m.
So congruence for those people

350
00:32:38,000 --> 00:32:43,000
who haven't seen much number
theory, is basically the way of

351
00:32:43,000 --> 00:32:48,000
essentially, rather than having
to say mod everywhere in here

352
00:32:48,000 --> 00:32:52,000
and mod everywhere in here,
we just at the end say OK,

353
00:32:52,000 --> 00:32:56,000
do a mod at the end.
Everything is being done mod,

354
00:32:56,000 --> 00:32:59,000
module m.
And then typically we use a

355
00:32:59,000 --> 00:33:06,000
congruence sign.
OK, there's a more mathematical

356
00:33:06,000 --> 00:33:13,000
definition but this will work
for us engineers.

357
00:33:13,000 --> 00:33:18,000
OK, so everybody with me so
far?

358
00:33:18,000 --> 00:33:23,000
This is just applying the
definition.

359
00:33:23,000 --> 00:33:32,000
So that implies that the sum of
i equals 0 to r of a i x i minus

360
00:33:32,000 --> 00:33:41,000
y i is congruent to zeros mod m.
OK, just threw it on the other

361
00:33:41,000 --> 00:33:45,000
side and applied the
distributive law.

362
00:33:45,000 --> 00:33:49,000
Now what I'm going to do is
pull out the 0-th position

363
00:33:49,000 --> 00:33:53,000
because that's the one that I
care about.

364
00:33:53,000 --> 00:33:58,000
And this is where it saves me
on the math, compared to if I

365
00:33:58,000 --> 00:34:03,000
didn't say that it was 0.
I'd have to pull out x_i.

366
00:34:03,000 --> 00:34:05,000
It wouldn't matter,
but it just would make the math

367
00:34:05,000 --> 00:34:06,000
a little bit cruftier

368
00:34:23,000 --> 00:34:30,000
OK, so now we've just pulled
out one term.

369
00:34:30,000 --> 00:34:41,000
That implies that a_0 x_0 minus
y_0 is congruent to minus --

370
00:34:54,000 --> 00:34:58,000
-- mod m.
Now remember that when I have a

371
00:34:58,000 --> 00:35:02,000
minus number mod m,
I just map it into whatever,

372
00:35:02,000 --> 00:35:07,000
into that range from 0 to m
minus one.

373
00:35:07,000 --> 00:35:12,000
So for example,
minus five mod seven is two.

374
00:35:12,000 --> 00:35:19,000
So if any of these things are
negative, we simply translate

375
00:35:19,000 --> 00:35:27,000
them into by adding multiples of
mbecause adding multiples of m

376
00:35:27,000 --> 00:35:32,000
doesn't affect the congruence.

377
00:35:39,000 --> 00:35:41,000
OK.
And now for the next step,

378
00:35:41,000 --> 00:35:44,000
we need to use a number theory
fact.

379
00:35:44,000 --> 00:35:48,000
So let's pull out our number
theory --

380
00:35:57,000 --> 00:36:05,000
-- textbook and take a little
digression

381
00:36:10,000 --> 00:36:14,000
So this comes from the theory
of finite fields.

382
00:36:14,000 --> 00:36:17,000
So for people who are
knowledgeable,

383
00:36:17,000 --> 00:36:21,000
that's where you're plugging
your knowledge in.

384
00:36:21,000 --> 00:36:26,000
If you're not knowledgeable,
this is a great area of math to

385
00:36:26,000 --> 00:36:30,000
learn about.
So here's the fact.

386
00:36:30,000 --> 00:36:34,000
So let m be prime.
Then for any z,

387
00:36:34,000 --> 00:36:41,000
little z element of z sub m,
and z sub m is the integers mod

388
00:36:41,000 --> 00:36:46,000
m.
So this is essentially numbers

389
00:36:46,000 --> 00:36:51,000
from 0 to m minus one with all
the operations,

390
00:36:51,000 --> 00:36:57,000
times, minus,
plus, etc., defined on that

391
00:36:57,000 --> 00:37:04,000
such that if you end up outside
of the range of 0 to m minus

392
00:37:04,000 --> 00:37:11,000
one, you re-normalize by
subtracting or adding multiples

393
00:37:11,000 --> 00:37:21,000
of m to get back within the
range from 0 to m minus one.

394
00:37:21,000 --> 00:37:30,000
So it's the standard thing of
just doing things module m.

395
00:37:30,000 --> 00:37:38,000
So for any z such that z is not
congruent to 0,

396
00:37:38,000 --> 00:37:47,000
there exists a unique z inverse
in z sub m, such that if I

397
00:37:47,000 --> 00:37:57,000
multiply z times the inverse,
it produces something congruent

398
00:37:57,000 --> 00:38:04,000
to one mod m.
So for any number it says,

399
00:38:04,000 --> 00:38:11,000
I can find another number that
when multiplied by it gives me

400
00:38:11,000 --> 00:38:15,000
one.
So let's just do an example for

401
00:38:15,000 --> 00:38:18,000
m equals seven.
So here we have,

402
00:38:18,000 --> 00:38:24,000
we'll make a little table.
So z is not equal to 0,

403
00:38:24,000 --> 00:38:29,000
so I just write down the other
numbers.

404
00:38:29,000 --> 00:38:35,000
And let's figure out what z
inverse is.

405
00:38:35,000 --> 00:38:41,000
So what's the inverse of one?
What number when multiplied by

406
00:38:41,000 --> 00:38:43,000
one gives me one?
One.

407
00:38:43,000 --> 00:38:45,000
Good.
How about two?

408
00:38:45,000 --> 00:38:51,000
What number when I multiply it
by two gives me one?

409
00:38:51,000 --> 00:38:55,000
Four.
Because two times four is eight

410
00:38:55,000 --> 00:39:01,000
and eight is congruent to one
mod seven.

411
00:39:01,000 --> 00:39:04,000
So I've re-normalized it.
What about three?

412
00:39:12,000 --> 00:39:13,000
Five.
Good.

413
00:39:13,000 --> 00:39:16,000
Five.
Three times five is 15.

414
00:39:16,000 --> 00:39:22,000
That's congruent to one mod
seven because 15 divided by

415
00:39:22,000 --> 00:39:28,000
seven is two remainder of one.
So that's the key thing.

416
00:39:28,000 --> 00:39:32,000
What about four?
Two.

417
00:39:32,000 --> 00:39:36,000
Five? Three. And six.

418
00:39:43,000 --> 00:39:43,000
Yeah.
Six.
Yeah, six it turns out.
OK, six times six is 36.

419
00:39:48,000 --> 00:39:52,000
OK, mod seven.
Basically subtract off the 35,

420
00:39:52,000 --> 00:39:56,000
gives m one.
So people have observed some

421
00:39:56,000 --> 00:40:02,000
interesting facts that if one
number's an inverse of another,

422
00:40:02,000 --> 00:40:08,000
then that other is an inverse
of the one.

423
00:40:08,000 --> 00:40:12,000
So that's actually one of these
things that you prove when you

424
00:40:12,000 --> 00:40:16,000
do group theory and field theory
and so forth.

425
00:40:16,000 --> 00:40:21,000
There are all sorts of other
great properties of this kind of

426
00:40:21,000 --> 00:40:23,000
math.
But the main thing is,

427
00:40:23,000 --> 00:40:27,000
and this turns out not to be
true if m is not a prime.

428
00:40:27,000 --> 00:40:31,000
So can somebody think of,
imagine we're doing something

429
00:40:31,000 --> 00:40:36,000
mod 10.
Can somebody think of a number

430
00:40:36,000 --> 00:40:39,000
that doesn't have an inverse mod
10?

431
00:40:39,000 --> 00:40:40,000
Yeah.
Two.

432
00:40:40,000 --> 00:40:45,000
Another one is five.
OK, it turns out the divisors

433
00:40:45,000 --> 00:40:49,000
in fact actually,
more generally,

434
00:40:49,000 --> 00:40:53,000
something that is not
relatively prime,

435
00:40:53,000 --> 00:40:58,000
meaning that it has no common
factors, the GCD is not one

436
00:40:58,000 --> 00:41:04,000
between that number and the
modulus.

437
00:41:04,000 --> 00:41:08,000
OK, those numbers do not have
an inverse mod m.

438
00:41:08,000 --> 00:41:13,000
OK, but if it's prime,
every number is relatively

439
00:41:13,000 --> 00:41:17,000
prime to the modulus.
And that's the property that

440
00:41:17,000 --> 00:41:22,000
we're taking advantage of.
So this is our fact and so,

441
00:41:22,000 --> 00:41:28,000
in this case what I'm after is
I want to divide by x_0 minus

442
00:41:28,000 --> 00:41:31,000
y_0.
That's what I want to do at

443
00:41:31,000 --> 00:41:34,000
this point.
But I can't do that if x_0,

444
00:41:34,000 --> 00:41:36,000
first of all,
if m isn't prime,

445
00:41:36,000 --> 00:41:40,000
I can't necessarily do that.
I might be able to,

446
00:41:40,000 --> 00:41:43,000
but I can't necessarily.
But if m is prime,

447
00:41:43,000 --> 00:41:46,000
I can definitely divide by x_0
minus y_0.

448
00:41:46,000 --> 00:41:49,000
I can find that inverse.
And the other thing I have to

449
00:41:49,000 --> 00:41:52,000
do is make sure x_0 minus y_0 is
not 0.

450
00:41:52,000 --> 00:41:57,000
OK, it would be 0 if these two
were equal, but our supposition

451
00:41:57,000 --> 00:42:01,000
was they weren't equal.
And once again,

452
00:42:01,000 --> 00:42:05,000
just bringing it back to the
without loss of generality,

453
00:42:05,000 --> 00:42:08,000
if it were some other position
that we were off,

454
00:42:08,000 --> 00:42:13,000
I would be doing exactly the
same thing with that position.

455
00:42:13,000 --> 00:42:16,000
So now we're going to be able
to divide.

456
00:42:16,000 --> 00:42:19,000
So we continue with our --

457
00:42:24,000 --> 00:42:33,000
-- continue with our proof.
So since x_0 is not equal to

458
00:42:33,000 --> 00:42:42,000
y_0, there exists an inverse for
x_0 minus y_0.

459
00:42:42,000 --> 00:42:48,000
And that implies,
just continue on from over

460
00:42:48,000 --> 00:42:56,000
there, that a_0 is congruent
therefore to minus the sum of i

461
00:42:56,000 --> 00:43:04,000
equal one to r of a_i,
x_i minus y_i times x_0 minus

462
00:43:04,000 --> 00:43:10,000
y_0 inverse.
So let's just go back to the

463
00:43:10,000 --> 00:43:15,000
beginning of our proof and see
what we've derived.

464
00:43:15,000 --> 00:43:19,000
If we're saying we have two
distinct keys,

465
00:43:19,000 --> 00:43:24,000
and we've picked all of these
a_i randomly,

466
00:43:24,000 --> 00:43:30,000
and we're saying that these two
distinct keys hash to the same

467
00:43:30,000 --> 00:43:34,000
place.
If they hash to the same place,

468
00:43:34,000 --> 00:43:41,000
it says that a_0 essentially
had to have a particular value

469
00:43:41,000 --> 00:43:47,000
as a function of the other a_i.
Because in other words,

470
00:43:47,000 --> 00:43:51,000
once I've picked each of these
a_i from one to r,

471
00:43:51,000 --> 00:43:54,000
if I did them in that order,
for example,

472
00:43:54,000 --> 00:43:58,000
then I don't have a choice for
how I pick a_0 to make it

473
00:43:58,000 --> 00:44:00,000
collide.
Exactly one value allows it to

474
00:44:00,000 --> 00:44:05,000
collide, namely the value of a_0
given by this.

475
00:44:05,000 --> 00:44:10,000
If I picked a different value
of a_0, they wouldn't collide.

476
00:44:10,000 --> 00:44:16,000
So let m write that down.
Thus, while you think about it

477
00:45:12,000 --> 00:45:18,000
So for any choice of these a_i,
there's exactly one of the

478
00:45:18,000 --> 00:45:24,000
impossible choices of a_0 that
cause a collision.

479
00:45:24,000 --> 00:45:29,000
And for all the other choices I
might make of a_0,

480
00:45:29,000 --> 00:45:36,000
there's n collision.
So essentially I don't have,

481
00:45:36,000 --> 00:45:42,000
if they're going to collide,
I've reduced essentially the

482
00:45:42,000 --> 00:45:49,000
number of degrees of freedom of
my randomness by a factor of m.

483
00:45:49,000 --> 00:45:55,000
So if I count up the number of
h_a's that cause x and y to

484
00:45:55,000 --> 00:46:01,000
collide, that's equal to,
well, there's m choices,

485
00:46:01,000 --> 00:46:06,000
just using the product rule
again.

486
00:46:06,000 --> 00:46:13,000
There's m choices for a_1 times
m choices for a_2,

487
00:46:13,000 --> 00:46:21,000
up to m choices for a_r and
then only one choice for a_0.

488
00:46:21,000 --> 00:46:28,000
So this is choices for a_1,
a_2, a_r and only one choice

489
00:46:28,000 --> 00:46:35,000
for a_0 if they're going to
collide.

490
00:46:35,000 --> 00:46:40,000
If they're not going to
collide, I've got more choices

491
00:46:40,000 --> 00:46:43,000
for a_0.
But if I want them to collide,

492
00:46:43,000 --> 00:46:48,000
there's only one value I can
pick, namely this value.

493
00:46:48,000 --> 00:46:53,000
That's the only value for which
I will pick.

494
00:46:53,000 --> 00:46:58,000
And that's equal to m to the r,
which is just the size of H

495
00:46:58,000 --> 00:47:03,000
divided by m.
And that completes the proof.

496
00:47:11,000 --> 00:47:14,000
So there are other universal
constructions,

497
00:47:14,000 --> 00:47:18,000
but this is a particularly
elegant one.

498
00:47:18,000 --> 00:47:22,000
So the point is that I have m
plus one, sorry,

499
00:47:22,000 --> 00:47:27,000
r plus one degrees of freedom
where each degree of freedom I

500
00:47:27,000 --> 00:47:33,000
have m choices.
But if I want them to collide,

501
00:47:33,000 --> 00:47:40,000
once I've picked any of the,
once I've picked r of those

502
00:47:40,000 --> 00:47:45,000
possible choices,
the last one is forced if I

503
00:47:45,000 --> 00:47:48,000
want it to collide.
So therefore,

504
00:47:48,000 --> 00:47:55,000
the set of functions for which
it collides is only one in m.

505
00:47:55,000 --> 00:48:01,000
A very slick construction.
Very slick.

506
00:48:01,000 --> 00:48:03,000
OK.
Everybody with me here?

507
00:48:03,000 --> 00:48:07,000
Didn't lose too many people?
Yeah, question.

508
00:48:07,000 --> 00:48:12,000
Well, part of it is,
actually this is a quite common

509
00:48:12,000 --> 00:48:15,000
type of thing to be doing
actually.

510
00:48:15,000 --> 00:48:19,000
If you take a class,
so we have follow on classes in

511
00:48:19,000 --> 00:48:24,000
cryptography and so forth,
and this kind of thing of

512
00:48:24,000 --> 00:48:29,000
taking dot products,
modulo m and also Galois fields

513
00:48:29,000 --> 00:48:34,000
which are particularly simple
finite fields and things like

514
00:48:34,000 --> 00:48:40,000
that, people play with these all
the time.

515
00:48:40,000 --> 00:48:43,000
So Galois fields are like using
exor's as your,

516
00:48:43,000 --> 00:48:46,000
same sort of thing as this
except base two.

517
00:48:46,000 --> 00:48:49,000
And so there's a lot of study
of this sort of thing.

518
00:48:49,000 --> 00:48:53,000
So people understand these kind
of properties.

519
00:48:53,000 --> 00:48:57,000
But yeah, it's like what's the
algorithm for having a brilliant

520
00:48:57,000 --> 00:49:01,000
insight into algorithms?
It's like OK.

521
00:49:01,000 --> 00:49:05,000
Wish I knew.
Then I'd just turn the crank.

522
00:49:05,000 --> 00:49:11,000
[LAUGHTER] But if it were that
easy, I wouldn't be standing up

523
00:49:11,000 --> 00:49:13,000
here today.
[LAUGHTER] Good.

524
00:49:13,000 --> 00:49:19,000
OK, so now I want to take on
another topic which is also I

525
00:49:19,000 --> 00:49:22,000
find, I think this is
astounding.

526
00:49:22,000 --> 00:49:27,000
It's just beautiful,
beautiful mathematics and a big

527
00:49:27,000 --> 00:49:34,000
impact on your ability to build
good hash functions.

528
00:49:34,000 --> 00:49:37,000
Now I want to talk about
another one topic,

529
00:49:37,000 --> 00:49:41,000
which is related,
which is the topic of perfect

530
00:49:41,000 --> 00:49:42,000
hashing.

531
00:49:54,000 --> 00:49:59,000
So everything we've done so far
does expected time performance.

532
00:49:59,000 --> 00:50:03,000
Hashing is good in the expected
sense.

533
00:50:03,000 --> 00:50:08,000
A perfect hashing addresses the
following questions.

534
00:50:08,000 --> 00:50:14,000
Suppose that I gave you a set
of keys, and I said just build

535
00:50:14,000 --> 00:50:20,000
me a static table so I can look
up whether the key is in the

536
00:50:20,000 --> 00:50:25,000
table with worst case time.
Good worst case time.

537
00:50:25,000 --> 00:50:31,000
So I have a fixed set of keys.
They might be something like

538
00:50:31,000 --> 00:50:37,000
for example, the hundred most
common or thousand most common

539
00:50:37,000 --> 00:50:42,000
words in English.
And when I get a word I want to

540
00:50:42,000 --> 00:50:47,000
check quickly in this table,
is the word that I've got one

541
00:50:47,000 --> 00:50:49,000
of the most common words in
English.

542
00:50:49,000 --> 00:50:54,000
I would like to do that not
with expected performance,

543
00:50:54,000 --> 00:50:57,000
but guaranteed worst case
performance.

544
00:50:57,000 --> 00:51:03,000
Is there a way of building it
so that I can find this quickly?

545
00:51:03,000 --> 00:51:06,000
So the problem is given n keys
--

546
00:51:12,000 --> 00:51:14,000
-- construct a static hash
table.

547
00:51:14,000 --> 00:51:17,000
In other words,
no insertion and deletion.

548
00:51:17,000 --> 00:51:20,000
We're just going to put the
elements in there.

549
00:51:20,000 --> 00:51:22,000
A size --

550
00:51:30,000 --> 00:51:37,000
-- m equal Order n.
So I don't want it to be a huge

551
00:51:37,000 --> 00:51:42,000
table.
I want it to be a table that is

552
00:51:42,000 --> 00:51:50,000
the size of my keys.
Table of size m equals Order n,

553
00:51:50,000 --> 00:51:59,000
such that search takes O(1)
time in the worst case.

554
00:52:06,000 --> 00:52:10,000
So there's no place in the
table where I'm going to have,

555
00:52:10,000 --> 00:52:14,000
I know in the average case,
that's not hard to do.

556
00:52:14,000 --> 00:52:18,000
But in the worst case,
I want to make sure that

557
00:52:18,000 --> 00:52:22,000
there's no particular spot where
the number of keys piles up to

558
00:52:22,000 --> 00:52:26,000
be a large number.
OK, in no spot should that

559
00:52:26,000 --> 00:52:29,000
happen.
Every single search I do should

560
00:52:29,000 --> 00:52:33,000
take Order one time.
There shouldn't be any

561
00:52:33,000 --> 00:52:37,000
statistical variation in terms
of how long it takes me to get

562
00:52:37,000 --> 00:52:39,000
something.
Does everybody understand what

563
00:52:39,000 --> 00:52:42,000
the puzzle is?
So this is a great,

564
00:52:42,000 --> 00:52:45,000
because this actually ends up
having a lot of uses.

565
00:52:45,000 --> 00:52:49,000
You know, you want to build a
table for something and you know

566
00:52:49,000 --> 00:52:52,000
what the values are that you're
going look up in it.

567
00:52:52,000 --> 00:52:56,000
But you don't want to spend a
lot of space on it and so forth.

568
00:52:56,000 --> 00:53:00,000
So the idea here is actually
going to be to use a two-level

569
00:53:00,000 --> 00:53:02,000
scheme.

570
00:53:09,000 --> 00:53:22,000
So the idea is we're going to
use a two-level scheme with

571
00:53:22,000 --> 00:53:31,000
universal hashing at both
levels.

572
00:53:31,000 --> 00:53:36,000
So the idea is we're going to
hash, we're going to have a hash

573
00:53:36,000 --> 00:53:41,000
table, we're going to hash into
slots, but rather than using

574
00:53:41,000 --> 00:53:46,000
chaining, we're going to have
another hash table there.

575
00:53:46,000 --> 00:53:51,000
We're going to do a second hash
into the second hash table.

576
00:53:51,000 --> 00:53:56,000
And the idea is that we're
going to do it in such a way

577
00:53:56,000 --> 00:54:01,000
that we have no collisions at
level two.

578
00:54:01,000 --> 00:54:03,000
So we may have collisions at
level one.

579
00:54:03,000 --> 00:54:08,000
We'll take anything that
collides at level one and put

580
00:54:08,000 --> 00:54:12,000
them into a hash table and then
our second level hash table,

581
00:54:12,000 --> 00:54:15,000
but that hash table,
no collisions.

582
00:54:15,000 --> 00:54:17,000
Boom.
We're just going to hash right

583
00:54:17,000 --> 00:54:20,000
in there.
And it'll just go boom to its

584
00:54:20,000 --> 00:54:23,000
thing.
So let's draw a picture of this

585
00:54:23,000 --> 00:54:28,000
to illustrate the scheme.
OK, so we have --

586
00:54:34,000 --> 00:54:37,000
-- 0 one, let's say six,
m minus one.

587
00:54:37,000 --> 00:54:42,000
So here's our hash table.
And what we're going to do is

588
00:54:42,000 --> 00:54:47,000
we're going to use universal
hashing at the first level,

589
00:54:47,000 --> 00:54:49,000
OK.
So we find a universal hash

590
00:54:49,000 --> 00:54:52,000
function.
We pick a hash function at

591
00:54:52,000 --> 00:54:56,000
random.
And what we'll do is we'll hash

592
00:54:56,000 --> 00:55:00,000
into that level.
And then what we'll do is we'll

593
00:55:00,000 --> 00:55:05,000
keep track of two things.
One is what the size of the

594
00:55:05,000 --> 00:55:09,000
hash table is at the next level.
So in this case,

595
00:55:09,000 --> 00:55:13,000
the size of the hash table will
only use the number of slots.

596
00:55:13,000 --> 00:55:17,000
There's going to be four.
And we're also going to keep a

597
00:55:17,000 --> 00:55:19,000
separate hash key for the second
level.

598
00:55:19,000 --> 00:55:23,000
So each slot will have its own
hash function for the second

599
00:55:23,000 --> 00:55:25,000
level.
So for example,

600
00:55:25,000 --> 00:55:30,000
this one might have a key of 31
that is a random number.

601
00:55:30,000 --> 00:55:32,000
The a's here.
a's up there.

602
00:55:32,000 --> 00:55:34,000
There we go,
a's up there.

603
00:55:34,000 --> 00:55:39,000
So that's going to be the basis
of my hash function,

604
00:55:39,000 --> 00:55:42,000
the key with which I'm going to
hash.

605
00:55:42,000 --> 00:55:46,000
This one say has 86.
And let's say that this,

606
00:55:46,000 --> 00:55:50,000
and then we have a pointer to
the hash table.

607
00:55:50,000 --> 00:55:55,000
This is say S_1.
And it's got four slots and we

608
00:55:55,000 --> 00:56:01,000
stored up 14 and 27.
And these two slots are empty.

609
00:56:01,000 --> 00:56:09,000
And this one for example,
had what?

610
00:56:09,000 --> 00:56:12,000
Two nines.

611
00:56:28,000 --> 00:56:34,000
So the idea here is that in
this case if we look over all

612
00:56:34,000 --> 00:56:40,000
our top level hash function,
which I'll just call H,

613
00:56:40,000 --> 00:56:47,000
has that H of 14 is equal to H
of 27 is equal to one.

614
00:56:47,000 --> 00:56:53,000
Because we're in slot one.
OK, so these two both hash to

615
00:56:53,000 --> 00:56:57,000
the same slot in the level one
hash table.

616
00:56:57,000 --> 00:57:02,000
This is level one.
And this is level two over

617
00:57:02,000 --> 00:57:06,000
here.
So level one hashing,

618
00:57:06,000 --> 00:57:11,000
14 and 27 collided.
They went into the same slot

619
00:57:11,000 --> 00:57:13,000
here.
But at level two,

620
00:57:13,000 --> 00:57:20,000
they got hashed to different
places and the hash function I

621
00:57:20,000 --> 00:57:26,000
use is going to be indexed by
whatever the random numbers are

622
00:57:26,000 --> 00:57:33,000
that I chose and found for those
and I'll show you how we find

623
00:57:33,000 --> 00:57:36,000
those.
We have then h of 31 of 14 is

624
00:57:36,000 --> 00:57:43,000
equal to one h of 31 of 27 is
equal to two.

625
00:57:43,000 --> 00:57:46,000
For level two.
So I go, hash in here,

626
00:57:46,000 --> 00:57:51,000
find the, use this as the basis
of my hash function to hash into

627
00:57:51,000 --> 00:57:55,000
whatever table I've got here.
And so, if there are no,

628
00:57:55,000 --> 00:58:00,000
if I can guarantee that there
are no collisions at level two,

629
00:58:00,000 --> 00:58:05,000
this is going to cost me Order
one time in the worst case to

630
00:58:05,000 --> 00:58:09,000
look something up.
How do I look it up?

631
00:58:09,000 --> 00:58:12,000
Take the value.
I apply h to it.

632
00:58:12,000 --> 00:58:16,000
That takes me to some slot.
Then I look to see what the key

633
00:58:16,000 --> 00:58:21,000
is for this hash function.
I apply that hash function and

634
00:58:21,000 --> 00:58:24,000
that takes me to another slot.
Then I go there.

635
00:58:24,000 --> 00:58:29,000
And that took me basically two
applications of hash functions

636
00:58:29,000 --> 00:58:33,000
plus some look-up,
plus who knows what minor

637
00:58:33,000 --> 00:58:41,000
amount of bookkeeping.
So the reason we're going to

638
00:58:41,000 --> 00:58:50,000
have no collisions at this level
is the following.

639
00:58:50,000 --> 00:59:01,000
If they're n sub i items that
hash to a level one slot i,

640
00:59:01,000 --> 00:59:11,000
then we're going to use m sub
i, which is equal to n sub i

641
00:59:11,000 --> 00:59:21,000
squared slots in the level two
hash table.

642
00:59:29,000 --> 00:59:33,000
OK, so I should have mentioned
here this is going to be m sub

643
00:59:33,000 --> 00:59:37,000
i, the size of the hash table
and this is going to be my a sub

644
00:59:37,000 --> 00:59:39,000
i essentially.

645
00:59:45,000 --> 00:59:50,000
So I'm going to use,
so basically I'm going to hash

646
00:59:50,000 --> 00:59:55,000
n sub i things into n sub i
squared locations here.

647
00:59:55,000 --> 01:00:00,000
So this is going to be
incredibly sparse.

648
01:00:00,000 --> 01:00:02,480
OK, it's going to be quadratic
in size.

649
01:00:02,480 --> 01:00:05,612
And so what I'm going to show
is that under those

650
01:00:05,612 --> 01:00:08,418
circumstances,
it's easy for me to find hash

651
01:00:08,418 --> 01:00:11,159
functions such that there are n
collisions.

652
01:00:11,159 --> 01:00:15,010
That's the name of the game.
Figure out how can I make these

653
01:00:15,010 --> 01:00:18,012
hash functions so that there are
no collisions.

654
01:00:18,012 --> 01:00:21,341
So that's why I draw this with
so few elements here.

655
01:00:21,341 --> 01:00:24,604
So here for example,
I have two elements and I have

656
01:00:24,604 --> 01:00:27,867
a hash table size four here.
I have three elements.

657
01:00:27,867 --> 01:00:32,520
I need a hash table size nine.
OK, if there are a hundred

658
01:00:32,520 --> 01:00:34,918
elements, I need a hash table
size 10,000.

659
01:00:34,918 --> 01:00:38,485
I'm not going to pick something
so there's likely that there's

660
01:00:38,485 --> 01:00:41,350
anything of that size.
And then the fact that this

661
01:00:41,350 --> 01:00:44,801
actually works and gives us all
the properties that we want,

662
01:00:44,801 --> 01:00:48,251
that's part of the analysis.
So does everybody see that this

663
01:00:48,251 --> 01:00:51,877
takes Order one worst case time
and what the basic structure of

664
01:00:51,877 --> 01:00:52,988
it is?
These things,

665
01:00:52,988 --> 01:00:55,210
by the way, are not in this
case prime.

666
01:00:55,210 --> 01:00:58,134
I could always pick primes that
were close to this.

667
01:00:58,134 --> 01:01:03,730
I didn't do that in this case.
Or I could use a universal hash

668
01:01:03,730 --> 01:01:09,103
function that in fact would work
for things other than primes.

669
01:01:09,103 --> 01:01:12,362
But I didn't do that for this
example.

670
01:01:12,362 --> 01:01:16,943
We all ready for analysis?
OK, let's do some analysis

671
01:01:16,943 --> 01:01:18,000
then.

672
01:01:29,000 --> 01:01:31,000
And this is really pretty
analysis.

673
01:01:31,000 --> 01:01:33,528
Partly as you'll see because
we've already done some of this

674
01:01:33,528 --> 01:01:34,000
analysis.

675
01:01:50,000 --> 01:01:53,238
So the trick is analyzing level
two.

676
01:01:53,238 --> 01:01:57,309
That's the main thing that I
want to analyze,

677
01:01:57,309 --> 01:02:02,583
to show that I can find hash
functions here that are going

678
01:02:02,583 --> 01:02:06,192
to, when I map them into,
very sparsely,

679
01:02:06,192 --> 01:02:09,523
into these arrays here,
that in fact,

680
01:02:09,523 --> 01:02:16,000
such hash functions exist and I
can compute them in advance.

681
01:02:16,000 --> 01:02:23,344
So that I have a good way of
storing those.

682
01:02:23,344 --> 01:02:30,338
So here's the theorem we're
going to use.

683
01:02:30,338 --> 01:02:40,830
My hash and keys into m equals
n squared slots using a random

684
01:02:40,830 --> 01:02:48,000
hash function in a universal set
H.

685
01:02:48,000 --> 01:03:00,393
Then the expected number of
collisions is less than one

686
01:03:00,393 --> 01:03:02,502
half.
OK.

687
01:03:02,502 --> 01:03:11,372
The expected number of
collisions I don't expect there

688
01:03:11,372 --> 01:03:20,577
to be even one collision.
I expect there to be less than

689
01:03:20,577 --> 01:03:29,447
half a collision on average.
And so, let's prove this,

690
01:03:29,447 --> 01:03:39,154
so that the probability that
two given keys collide under h

691
01:03:39,154 --> 01:03:45,216
is what?
What's the probability that two

692
01:03:45,216 --> 01:03:51,443
given keys collide under h when
h is chosen randomly from the

693
01:03:51,443 --> 01:03:54,037
universal set?
One over m.

694
01:03:54,037 --> 01:03:56,943
Right?
That's the definition,

695
01:03:56,943 --> 01:04:02,235
right, of, which is in this
case equal to one over n

696
01:04:02,235 --> 01:04:06,210
squared.
So now how many keys,

697
01:04:06,210 --> 01:04:11,052
how many pairs of keys do I
have in this table?

698
01:04:11,052 --> 01:04:16,526
How many keys could possibly
collide with each other?

699
01:04:16,526 --> 01:04:19,368
OK.
So that's basically just

700
01:04:19,368 --> 01:04:25,157
looking at how many different
pairs of keys do I have to

701
01:04:25,157 --> 01:04:30,315
evaluate this for.
So that's n choose two pairs of

702
01:04:30,315 --> 01:04:36,654
keys.
n choose two pairs of keys.

703
01:04:36,654 --> 01:04:42,689
So therefore,
the expected number of

704
01:04:42,689 --> 01:04:52,172
collisions is while for each of
these n, not n over two.

705
01:04:52,172 --> 01:05:00,793
n choose two pairs of keys.
The probability that it

706
01:05:00,793 --> 01:05:08,923
collides is one in n squared.
So that's equal to n times n

707
01:05:08,923 --> 01:05:12,221
minus one over two,
if you remember your formula,

708
01:05:12,221 --> 01:05:16,000
times one in n squared.
And that's less than a half.

709
01:05:24,000 --> 01:05:28,183
So for every pair of keys,
so those of you who remember

710
01:05:28,183 --> 01:05:33,063
from 6.042 the birthday paradox,
this is related to the birthday

711
01:05:33,063 --> 01:05:36,800
paradox a little bit.
But here I basically have a

712
01:05:36,800 --> 01:05:40,333
large set, and I'm looking at
all pairs, but my set is

713
01:05:40,333 --> 01:05:44,000
sufficiently big that the odds
that I get a collision is

714
01:05:44,000 --> 01:05:47,199
relatively small.
If I start increasing it beyond

715
01:05:47,199 --> 01:05:50,400
the square root of m,
OK, the number of elements,

716
01:05:50,400 --> 01:05:54,466
it starts getting bigger in the
square root of m then the odds

717
01:05:54,466 --> 01:05:57,733
of a collision go up
dramatically as you know from

718
01:05:57,733 --> 01:06:01,532
the birthday paradox.
But if I'm less than,

719
01:06:01,532 --> 01:06:05,401
if I'm really sparse in there,
I don't get collisions.

720
01:06:05,401 --> 01:06:09,197
Or at least I get a relatively
small number expected.

721
01:06:09,197 --> 01:06:13,430
Now I want to remind you of
something which actually in the

722
01:06:13,430 --> 01:06:17,080
past I have just assumed,
but I want to actually go

723
01:06:17,080 --> 01:06:20,291
through it briefly.
It's Markov's inequality.

724
01:06:20,291 --> 01:06:22,919
So who remembers Markov's
inequality?

725
01:06:22,919 --> 01:06:25,839
Don't everybody raise their
hand at once.

726
01:06:25,839 --> 01:06:30,000
So Markov's inequality says the
following.

727
01:06:30,000 --> 01:06:34,145
This is one of these great
probability facts.

728
01:06:34,145 --> 01:06:38,762
For random variable x which is
bounded below by 0,

729
01:06:38,762 --> 01:06:44,227
says the probability that x is
bigger than, greater than or

730
01:06:44,227 --> 01:06:49,316
equal to any given value T is
less than or equal to the

731
01:06:49,316 --> 01:06:53,838
expectation of x divided by T.
It's a great fact.

732
01:06:53,838 --> 01:06:57,796
Doesn't happen if x isn't bound
below by 0.

733
01:06:57,796 --> 01:07:03,230
But it's a great fact.
It allows me to relate the

734
01:07:03,230 --> 01:07:06,833
probability of an event to its
expectation.

735
01:07:06,833 --> 01:07:12,066
And the idea is in general that
if the expectation is going to

736
01:07:12,066 --> 01:07:17,213
be small, then I can't have a
high probability that the value

737
01:07:17,213 --> 01:07:21,845
of the random variable is large.
It doesn't make sense.

738
01:07:21,845 --> 01:07:26,649
How could you have a high
probability that it's a million

739
01:07:26,649 --> 01:07:31,968
when my expectation is one or in
this case we're going to apply

740
01:07:31,968 --> 01:07:36,000
it when the expectation is a
half?

741
01:07:36,000 --> 01:07:39,676
Couldn't happen.
And the proof follows just

742
01:07:39,676 --> 01:07:44,666
directly on the definition of
expectation, and so I'mdoing

743
01:07:44,666 --> 01:07:47,730
this for a discrete random
variable.

744
01:07:47,730 --> 01:07:52,282
So the expectation by
definition is just the sum from

745
01:07:52,282 --> 01:07:57,622
little x goes to 0 to infinity
of x times the probability that

746
01:07:57,622 --> 01:08:02,000
my random variable takes on the
value x.

747
01:08:02,000 --> 01:08:06,560
That's the definition.
And now it's just a question of

748
01:08:06,560 --> 01:08:11,120
doing like the coarsest
approximation you can imagine.

749
01:08:11,120 --> 01:08:14,734
First of all,
let me just simply throw away

750
01:08:14,734 --> 01:08:19,725
all small terms that can be
greater to or equal to x equals

751
01:08:19,725 --> 01:08:24,716
T to infinity of x times the
probability that x is equal to

752
01:08:24,716 --> 01:08:28,072
little x.
So just throw away all the low

753
01:08:28,072 --> 01:08:31,426
order terms.
Now what I'm going to do is

754
01:08:31,426 --> 01:08:36,848
replace every one of these terms
is lower bounded by the value x

755
01:08:36,848 --> 01:08:42,875
equals T.
So that's just the summation of

756
01:08:42,875 --> 01:08:49,750
x equals T to infinity of T
times the probability that x

757
01:08:49,750 --> 01:08:51,250
equals x.
OK.

758
01:08:51,250 --> 01:08:58,250
Over x going from T larger.
Because these are only bigger

759
01:08:58,250 --> 01:09:02,009
values.
And that's just equal then to

760
01:09:02,009 --> 01:09:06,306
T, because I can pull that out,
and the summation of x equals T

761
01:09:06,306 --> 01:09:10,256
to infinity of the probability
that x equals x is just the

762
01:09:10,256 --> 01:09:14,000
probability that x is greater
than or equal to T.

763
01:09:20,000 --> 01:09:26,000
And that's done because I just
divide by T.

764
01:09:31,000 --> 01:09:34,379
So that's Markov's inequality.
Really dumb.

765
01:09:34,379 --> 01:09:37,919
Really simple.
There are much stronger things

766
01:09:37,919 --> 01:09:42,264
like Chebyshev bounds and
Chernoff bounds and things of

767
01:09:42,264 --> 01:09:44,839
that nature.
But Markov's is like

768
01:09:44,839 --> 01:09:49,586
unbelievably simple and useful.
So we're going to just apply

769
01:09:49,586 --> 01:09:52,000
that as a corollary.

770
01:10:06,000 --> 01:10:13,059
So the probability now of no
collisions, when I hash n keys

771
01:10:13,059 --> 01:10:19,391
into n squared slots using a
universal hash function,

772
01:10:19,391 --> 01:10:26,817
I claim is the probability of
no collisions is greater than or

773
01:10:26,817 --> 01:10:32,173
equal to a half.
So I pick a hash function at

774
01:10:32,173 --> 01:10:36,409
random.
What are the odds that I got no

775
01:10:36,409 --> 01:10:40,917
collisions when I hashed those n
keys into n squared slots?

776
01:10:40,917 --> 01:10:43,326
Answer.
Probability is I have no

777
01:10:43,326 --> 01:10:47,834
collisions is at least a half.
Half the time I'm guaranteed

778
01:10:47,834 --> 01:10:51,409
that there won't be a collision.
And the proof,

779
01:10:51,409 --> 01:10:54,129
pretty simple.
The probability of no

780
01:10:54,129 --> 01:10:57,549
collisions is the same as the
probability as,

781
01:10:57,549 --> 01:11:01,746
sorry, is one minus the
probability that I have at most

782
01:11:01,746 --> 01:11:05,850
one collision.
So the odds that I have at

783
01:11:05,850 --> 01:11:09,337
least one collision,
the odds that I have at least

784
01:11:09,337 --> 01:11:12,254
one collision,
probability greater than or

785
01:11:12,254 --> 01:11:15,599
equal to one collision is less
than or equal to,

786
01:11:15,599 --> 01:11:18,872
now I just apply Markov's
inequality with this.

787
01:11:18,872 --> 01:11:23,000
So it's just the expected
number of collisions --

788
01:11:29,000 --> 01:11:33,090
-- divided by one.
And that is by Markov's

789
01:11:33,090 --> 01:11:36,272
inequality less than,
by definition,

790
01:11:36,272 --> 01:11:40,181
excuse me, of expected number
of collisions,

791
01:11:40,181 --> 01:11:44,363
which we've already shown,
is less than a half.

792
01:11:44,363 --> 01:11:49,636
So the probability of at least
one collision is less than a

793
01:11:49,636 --> 01:11:52,909
half.
The probability of 0 collisions

794
01:11:52,909 --> 01:11:56,363
is at least a half.
So we're done here.

795
01:11:56,363 --> 01:12:02,000
So to find a good level to hash
function is easy.

796
01:12:02,000 --> 01:12:06,562
I just test a few at random.
Most of them out there,

797
01:12:06,562 --> 01:12:10,856
OK, half of them,
at least half of them are going

798
01:12:10,856 --> 01:12:13,808
to work.
So this is in some sense,

799
01:12:13,808 --> 01:12:18,102
if you think about it,
a randomized construction,

800
01:12:18,102 --> 01:12:22,664
because I can't tell you which
one it's going to be.

801
01:12:22,664 --> 01:12:27,763
It's non-constructive in that
sense, but it's a randomized

802
01:12:27,763 --> 01:12:32,485
construction.
But they have to exist because

803
01:12:32,485 --> 01:12:36,297
most of them out there have this
good property.

804
01:12:36,297 --> 01:12:40,605
So I'mgoing to be able to find
for each one of these,

805
01:12:40,605 --> 01:12:44,168
I just test a few at random,
and I find one.

806
01:12:44,168 --> 01:12:47,068
Test a few at random,
find one, etc.

807
01:12:47,068 --> 01:12:50,548
Fill in my table there.
Because all that is

808
01:12:50,548 --> 01:12:53,945
pre-computation.
And I'mgoing to find them

809
01:12:53,945 --> 01:12:57,342
because the odds are good that
one exists.

810
01:12:57,342 --> 01:12:59,000
So --

811
01:13:13,000 --> 01:13:14,000
-- we just test a few at random.

812
01:13:24,000 --> 01:13:25,000
And we'll find one quickly --

813
01:13:32,000 --> 01:13:34,300
-- since at least half will
work.

814
01:13:34,300 --> 01:13:37,679
I just want to show that there
exists good ones.

815
01:13:37,679 --> 01:13:41,777
All I have to prove is that at
least one works for each of

816
01:13:41,777 --> 01:13:44,366
these cases.
In fact, I've shown that

817
01:13:44,366 --> 01:13:46,954
there's a huge number that will
work.

818
01:13:46,954 --> 01:13:50,189
Half of them will work.
But to show it exists,

819
01:13:50,189 --> 01:13:54,647
I would just have to show that
the probability was greater than

820
00:00:00,000 --> 01:13:55,941
So to finish up,

821
01:13:55,941 --> 01:14:00,254
we need to still analyze the
storage because I promised in my

822
01:14:00,254 --> 01:14:05,000
theorem that the table would be
of size order n.

823
01:14:05,000 --> 01:14:12,702
And yet now I've said there's
all of these quadratic-sized

824
01:14:12,702 --> 01:14:18,378
slots here.
So I'mgoing to show that that's

825
01:14:18,378 --> 01:14:20,000
order n.

826
01:14:31,000 --> 01:14:35,605
So for level one,
that's easy.

827
01:14:35,605 --> 01:14:45,450
We'll just choose the number of
slots to be equal to the number

828
01:14:45,450 --> 01:14:51,008
of keys.
And that way the storage at

829
01:14:51,008 --> 01:14:59,583
level one is just order n.
And now let's let n sub i be

830
01:14:59,583 --> 01:15:08,000
the random variable for the
number of keys --

831
01:15:13,000 --> 01:15:21,712
-- that hash to slot i in T.
OK, so n sub i is just what

832
01:15:21,712 --> 01:15:28,683
we've called it.
Number of elements that slot

833
01:15:28,683 --> 01:15:34,386
there.
And we're going to use m sub i

834
01:15:34,386 --> 01:15:45,000
equals n sub i squared slots in
each level two table S sub i.

835
01:15:45,000 --> 01:15:47,000
So the expected total storage --

836
01:15:54,000 --> 01:16:01,085
-- is just n for level one,
order n if you want,

837
01:16:01,085 --> 01:16:09,979
but basically n slots for level
one plus the expected value,

838
01:16:09,979 --> 01:16:19,326
whatever I expect the sum of i
equals 0 to m minus one of theta

839
01:16:19,326 --> 01:16:24,000
of n sub i squared to be.

840
01:16:30,000 --> 01:16:36,048
Because I basically have to add
up the square for every element

841
01:16:36,048 --> 01:16:40,731
that applies here,
the square of what's in there.

842
01:16:40,731 --> 01:16:46,682
Who recognizes this summation?
Where have we seen that before?

843
01:16:46,682 --> 01:16:51,951
Who attends recitation?
Where have we seen this before?

844
01:16:51,951 --> 01:16:54,000
What's the --

845
01:17:03,000 --> 01:17:06,000
We're summing the expected
value of a bunch of --

846
01:17:11,000 --> 01:17:14,959
Yeah, what was that algorithm?
We did the sorting algorithm,

847
01:17:14,959 --> 01:17:17,375
right?
What was the sorting algorithm

848
01:17:17,375 --> 01:17:21,000
for which this was an important
thing to evaluate?

849
01:17:26,000 --> 01:17:29,272
Don't everybody shout it out at
once.

850
01:17:29,272 --> 01:17:33,000
What was that sorting algorithm
called?

851
01:17:33,000 --> 01:17:35,397
Bucket sort.
Good.

852
01:17:35,397 --> 01:17:37,794
Bucket sort.
Yeah.

853
01:17:37,794 --> 01:17:46,397
We showed that the sum of the
squares of random variables when

854
01:17:46,397 --> 01:17:53,025
they're falling randomly into n
bins is order n.

855
01:17:53,025 --> 01:17:55,000
Right?

856
01:18:16,000 --> 01:18:20,105
And you can also out of this
get a, as we did before,

857
01:18:20,105 --> 01:18:24,131
get a probability bound.
What's the probability that

858
01:18:24,131 --> 01:18:28,315
it's more than a certain amount
times n using Markov's

859
01:18:28,315 --> 01:18:31,394
inequality.
But this is the key thing is

860
01:18:31,394 --> 01:18:36,109
we've seen this analysis.
OK, we used it there in time,

861
01:18:36,109 --> 01:18:39,963
so there's a little bit,
but that's one of the reasons

862
01:18:39,963 --> 01:18:43,963
we study sorting at the
beginning of the term is because

863
01:18:43,963 --> 01:18:47,890
the techniques of sorting,
they just propagate into all

864
01:18:47,890 --> 01:18:52,327
these other areas of analysis.
You see a lot of the same kinds

865
01:18:52,327 --> 01:18:55,309
of things.
And so now that you know bucket

866
01:18:55,309 --> 01:18:59,018
sort clearly so well,
now you know that this without

867
01:18:59,018 --> 01:19:04,610
having to do any extra work.
So you might want to go back

868
01:19:04,610 --> 01:19:09,925
and review your bucket sort
analysis, because it's applied

869
01:19:09,925 --> 01:19:11,604
now.
Same analysis.

870
01:19:11,604 --> 01:19:12,909
Two places.
OK.

871
01:19:12,909 --> 01:19:18,411
Good recitation this Friday,
which will be a quiz review and

872
01:19:18,411 --> 01:19:22,794
we have a quiz next,
there's no class on Monday,

873
01:19:22,794 --> 01:19:26,151
but we have a quiz on next
Wednesday.

874
01:19:26,151 --> 01:19:31,000
OK, so good luck everybody on
the quiz.

875
01:19:31,000 --> 01:19:34,000
Make sure you get plenty of
sleep.