1
00:00:00,080 --> 00:00:01,770
The following
content is provided

2
00:00:01,770 --> 00:00:04,010
under a Creative
Commons license.

3
00:00:04,010 --> 00:00:06,860
Your support will help MIT
OpenCourseWare continue

4
00:00:06,860 --> 00:00:10,720
to offer high quality
educational resources for free.

5
00:00:10,720 --> 00:00:13,330
To make a donation or
view additional materials

6
00:00:13,330 --> 00:00:17,207
from hundreds of MIT courses,
visit MIT OpenCourseWare

7
00:00:17,207 --> 00:00:17,832
at ocw.mit.edu.

8
00:00:22,120 --> 00:00:26,120
PROFESSOR: A trilogy,
if you will, on hashing.

9
00:00:26,120 --> 00:00:28,230
We did a lot of
cool hashing stuff.

10
00:00:28,230 --> 00:00:31,320
In some sense, we already have
what we want with hashing.

11
00:00:31,320 --> 00:00:37,220
Hashing with chaining, we can
do constant expected time,

12
00:00:37,220 --> 00:00:43,610
I should say, constant
as long as-- yeah.

13
00:00:43,610 --> 00:00:47,710
If we're doing insert,
delete, and exact search.

14
00:00:47,710 --> 00:00:48,750
Is this key in there?

15
00:00:48,750 --> 00:00:49,730
If so, return the item.

16
00:00:49,730 --> 00:00:52,170
Otherwise, say no.

17
00:00:52,170 --> 00:00:54,390
And we do that with
hashing with chaining.

18
00:00:54,390 --> 00:00:58,760
Under the analysis we did was
with simple uniform hashing.

19
00:00:58,760 --> 00:01:01,140
An alternative is to use
universal hashing, which

20
00:01:01,140 --> 00:01:02,360
is not really in this class.

21
00:01:02,360 --> 00:01:07,970
But if you find this weird,
then this is less weird.

22
00:01:07,970 --> 00:01:10,590
And hashing with
chaining, the idea

23
00:01:10,590 --> 00:01:13,950
was we had this giant universe
of all keys, could be actually

24
00:01:13,950 --> 00:01:14,950
all integers.

25
00:01:14,950 --> 00:01:17,141
So it's infinite.

26
00:01:17,141 --> 00:01:19,390
But then what we actually
are storing in our structure

27
00:01:19,390 --> 00:01:21,980
is some finite set of n keys.

28
00:01:21,980 --> 00:01:25,000
Here, I'm labeling them
k1 through k4, n is four.

29
00:01:25,000 --> 00:01:29,490
But in general, you don't
know what they're going to be.

30
00:01:29,490 --> 00:01:32,970
We reduce that to
a table of size m

31
00:01:32,970 --> 00:01:36,260
by this hash function
h-- stuff drawn in red.

32
00:01:36,260 --> 00:01:37,980
And so here I have a
three way collision.

33
00:01:37,980 --> 00:01:40,430
These three keys all
map to one, and so I

34
00:01:40,430 --> 00:01:44,560
store a linked list
of k1, k4 and k2.

35
00:01:44,560 --> 00:01:46,040
They're in no particular order.

36
00:01:46,040 --> 00:01:47,880
That's the point
of that picture.

37
00:01:47,880 --> 00:01:50,719
Here k3 happens to
map to its own slot.

38
00:01:50,719 --> 00:01:52,510
And the other slots
are empty, so they just

39
00:01:52,510 --> 00:01:57,250
have a null saying there's
an empty linked list there.

40
00:01:57,250 --> 00:02:01,830
Total size of this
structure is n plus m.

41
00:02:01,830 --> 00:02:04,520
There's m to store the table.

42
00:02:04,520 --> 00:02:08,280
There's n to store the sum of
the lengths of all the lists

43
00:02:08,280 --> 00:02:10,460
is going to be n.

44
00:02:10,460 --> 00:02:15,170
And then we said the
expected chain length,

45
00:02:15,170 --> 00:02:18,210
if everything's uniform,
then the probability

46
00:02:18,210 --> 00:02:22,140
of a particular key going
to a particular slot is 1/m.

47
00:02:22,140 --> 00:02:24,650
And if everything's
nice and independent

48
00:02:24,650 --> 00:02:26,570
or if you use
universal hashing, you

49
00:02:26,570 --> 00:02:32,020
can show that the total
expected chain length is n/m.

50
00:02:32,020 --> 00:02:34,520
n independent trials,
each probability 1/m

51
00:02:34,520 --> 00:02:35,970
of falling here.

52
00:02:35,970 --> 00:02:39,500
And we call that
alpha, the load factor.

53
00:02:39,500 --> 00:02:42,010
And we concluded that
the operation time

54
00:02:42,010 --> 00:02:45,725
to do an insert, delete, or
search was order 1 plus alpha.

55
00:02:48,820 --> 00:02:50,940
So that's n expectation.

56
00:02:50,940 --> 00:02:52,560
So that was hashing
with chaining.

57
00:02:52,560 --> 00:02:53,680
This is good news.

58
00:02:53,680 --> 00:02:57,540
As long as alpha is a
constant, we get constant time.

59
00:02:57,540 --> 00:03:01,470
And just for
recollection, today we're

60
00:03:01,470 --> 00:03:03,970
not really going to be thinking
too much about what the hash

61
00:03:03,970 --> 00:03:06,680
function is, but just
remember two of them I talked

62
00:03:06,680 --> 00:03:10,890
about-- this one we actually
will use today, where you just

63
00:03:10,890 --> 00:03:12,480
take the key and
take it module m.

64
00:03:12,480 --> 00:03:15,480
That's one easy way of mapping
all integers into the space

65
00:03:15,480 --> 00:03:17,174
zero through m minus 1.

66
00:03:17,174 --> 00:03:18,590
That's called the
division method.

67
00:03:18,590 --> 00:03:20,850
Multiplication
method is more fancy.

68
00:03:20,850 --> 00:03:22,990
You multiply by
a random integer,

69
00:03:22,990 --> 00:03:26,210
and then you look at the
middle of that multiplication.

70
00:03:26,210 --> 00:03:29,690
And that's where lots
of copies of the key k

71
00:03:29,690 --> 00:03:33,980
get mixed up together and that's
sort of the name of hashing.

72
00:03:33,980 --> 00:03:39,070
And that's a better hash
function in the real world.

73
00:03:39,070 --> 00:03:40,750
So that's hashing with chaining.

74
00:03:40,750 --> 00:03:41,610
Cool?

75
00:03:41,610 --> 00:03:45,430
Now, it seemed like
a complete picture,

76
00:03:45,430 --> 00:03:48,210
but there's one crucial thing
that we're missing here.

77
00:03:48,210 --> 00:03:50,410
Any suggestions?

78
00:03:50,410 --> 00:03:55,490
If I went to go to implement
this data structure,

79
00:03:55,490 --> 00:03:56,930
what don't I know how to do?

80
00:04:00,150 --> 00:04:01,930
And one answer could
be the hash function,

81
00:04:01,930 --> 00:04:03,755
but we're going to ignore that.

82
00:04:03,755 --> 00:04:04,880
I know you know the answer.

83
00:04:04,880 --> 00:04:06,490
Does anyone else
know the answer?

84
00:04:06,490 --> 00:04:07,270
Yeah.

85
00:04:07,270 --> 00:04:08,477
AUDIENCE: Grow the table.

86
00:04:08,477 --> 00:04:09,560
PROFESSOR: Grow the table.

87
00:04:09,560 --> 00:04:10,070
Yeah.

88
00:04:10,070 --> 00:04:14,650
The question is,
what should m be?

89
00:04:14,650 --> 00:04:17,390
OK, we have to create
a table size m,

90
00:04:17,390 --> 00:04:19,240
and we put our keys into it.

91
00:04:19,240 --> 00:04:23,044
We know we'd like m to
be about the same as n.

92
00:04:23,044 --> 00:04:24,460
But the trouble
is we don't really

93
00:04:24,460 --> 00:04:26,962
know n because
insertions come along,

94
00:04:26,962 --> 00:04:28,670
and then we might have
to grow the table.

95
00:04:28,670 --> 00:04:31,910
If n gets really
big relative to m,

96
00:04:31,910 --> 00:04:35,100
we're in trouble because
this factor will go up

97
00:04:35,100 --> 00:04:37,380
and it will be no
longer constant time.

98
00:04:37,380 --> 00:04:40,600
The other hand, if we
set m to be really big,

99
00:04:40,600 --> 00:04:42,855
we're also kind of wasteful.

100
00:04:42,855 --> 00:04:44,230
The whole point
of this structure

101
00:04:44,230 --> 00:04:47,950
was to avoid having one
slot for every possible key

102
00:04:47,950 --> 00:04:50,450
because that was giant.

103
00:04:50,450 --> 00:04:52,680
We want it to save space.

104
00:04:52,680 --> 00:04:55,440
So we want m to be big enough
that our structure is fast,

105
00:04:55,440 --> 00:04:58,550
but small enough that it's
not wasteful in space.

106
00:04:58,550 --> 00:05:00,345
And so that's the
remaining question.

107
00:05:12,710 --> 00:05:16,170
We want m to be theta n.

108
00:05:16,170 --> 00:05:19,200
We want it to be omega n.

109
00:05:19,200 --> 00:05:21,620
So we want it to be at
least some constant times n,

110
00:05:21,620 --> 00:05:24,100
in order to make
alpha be a constant.

111
00:05:24,100 --> 00:05:26,290
And we want it to be
big O of n in order

112
00:05:26,290 --> 00:05:27,490
to make the space linear.

113
00:05:36,620 --> 00:05:39,800
And the way we're going to
do this, as we suggested,

114
00:05:39,800 --> 00:05:41,160
is to grow the table.

115
00:05:46,630 --> 00:05:48,970
We're going to start with
m equals some constant.

116
00:05:48,970 --> 00:05:50,136
Pick your favorite constant.

117
00:05:52,370 --> 00:05:54,020
That's 20.

118
00:05:54,020 --> 00:05:55,890
My favorite constant's 7.

119
00:05:55,890 --> 00:06:00,440
Probably want it to be a power
of two, but what the hell?

120
00:06:00,440 --> 00:06:06,580
And then we're going to grow
and shrink as necessary.

121
00:06:06,580 --> 00:06:09,520
This is a pretty obvious idea.

122
00:06:09,520 --> 00:06:12,450
The interesting part
is to get it to work.

123
00:06:12,450 --> 00:06:16,370
And it's going to introduce
a whole new concept, which

124
00:06:16,370 --> 00:06:17,085
is amortization.

125
00:06:20,340 --> 00:06:22,610
So it's going to be cool.

126
00:06:22,610 --> 00:06:24,450
Trust me.

127
00:06:24,450 --> 00:06:26,630
Not only are we going to
solve this problem of how

128
00:06:26,630 --> 00:06:31,220
to choose m, we're also going to
figure out how the Python data

129
00:06:31,220 --> 00:06:34,840
structure called list, also
known as array, is implemented.

130
00:06:34,840 --> 00:06:37,650
So it's the exactly
the same problem.

131
00:06:37,650 --> 00:06:39,730
I'll get to that in a moment.

132
00:06:39,730 --> 00:06:48,720
So for example, let's
say that we-- I said m

133
00:06:48,720 --> 00:06:50,170
should be theta n.

134
00:06:50,170 --> 00:06:55,550
Let's say we want m to be
at least n at all times.

135
00:06:55,550 --> 00:06:58,580
So what happens, we
start with m equals 8.

136
00:06:58,580 --> 00:07:02,620
And so, let's say we
start with an empty hash

137
00:07:02,620 --> 00:07:04,320
table, an empty dictionary.

138
00:07:04,320 --> 00:07:06,710
And then I insert eight things.

139
00:07:06,710 --> 00:07:09,280
And then I go to
insert the ninth thing.

140
00:07:09,280 --> 00:07:11,170
And I say, oh, now
m is bigger than n.

141
00:07:11,170 --> 00:07:12,570
What should I do?

142
00:07:20,030 --> 00:07:25,190
So this would be like at the
end of an insertion algorithm.

143
00:07:25,190 --> 00:07:27,820
After I insert something and
say oh, if m is greater than n,

144
00:07:27,820 --> 00:07:30,610
then I'm getting worried that m
is getting much bigger than n.

145
00:07:30,610 --> 00:07:33,370
So I'd like to grow the table.

146
00:07:33,370 --> 00:07:33,970
OK?

147
00:07:33,970 --> 00:07:37,530
Let's take a little diversion
to what does grow a table mean.

148
00:07:44,020 --> 00:07:48,170
So maybe I have
current size m and I'd

149
00:07:48,170 --> 00:07:51,707
like to go to a
new size, m prime.

150
00:07:51,707 --> 00:07:54,040
This would actually work if
you're growing or shrinking,

151
00:07:54,040 --> 00:07:58,130
but m could be bigger
or smaller than m prime.

152
00:07:58,130 --> 00:07:59,820
What should I do--
what do I need

153
00:07:59,820 --> 00:08:03,050
to do in order to build
a new table of this size?

154
00:08:07,680 --> 00:08:09,430
Easy warm up.

155
00:08:09,430 --> 00:08:10,360
Yeah?

156
00:08:10,360 --> 00:08:13,284
AUDIENCE: Allocate the memory
and then rehash [INAUDIBLE].

157
00:08:13,284 --> 00:08:13,950
PROFESSOR: Yeah.

158
00:08:13,950 --> 00:08:16,060
Allocate the memory and rehash.

159
00:08:16,060 --> 00:08:17,860
So we have all these keys.

160
00:08:17,860 --> 00:08:20,660
They're stored with some
hash function in here,

161
00:08:20,660 --> 00:08:22,330
in table size m.

162
00:08:22,330 --> 00:08:24,070
I need to build an
entirely new table,

163
00:08:24,070 --> 00:08:28,205
size m prime, and then I
need to rehash everything.

164
00:08:54,660 --> 00:09:02,180
One way to think of this is
for each item in the old table,

165
00:09:02,180 --> 00:09:11,490
insert into the
new table, T prime.

166
00:09:11,490 --> 00:09:15,820
I think that's worth a cushion.

167
00:09:15,820 --> 00:09:17,360
You got one?

168
00:09:17,360 --> 00:09:18,791
You don't want to get hit.

169
00:09:18,791 --> 00:09:19,290
It's fine.

170
00:09:19,290 --> 00:09:21,540
We're not burning through
these questions fast enough,

171
00:09:21,540 --> 00:09:22,740
so answer more questions.

172
00:09:25,321 --> 00:09:25,820
OK.

173
00:09:25,820 --> 00:09:27,661
So how much time does this take?

174
00:09:27,661 --> 00:09:29,285
That's the main point
of this exercise.

175
00:09:38,011 --> 00:09:39,500
Yeah?

176
00:09:39,500 --> 00:09:40,600
AUDIENCE: Order n.

177
00:09:40,600 --> 00:09:42,160
PROFESSOR: Order n.

178
00:09:42,160 --> 00:09:46,060
Yeah, I think as long as
m and m prime are theta n,

179
00:09:46,060 --> 00:09:47,720
this is order n.

180
00:09:47,720 --> 00:09:53,270
In general, it's going to
be n plus m plus m prime,

181
00:09:53,270 --> 00:09:54,006
but you're right.

182
00:09:54,006 --> 00:09:55,380
Most of the time
that's-- I mean,

183
00:09:55,380 --> 00:09:57,440
in the situation we're
going to construct,

184
00:09:57,440 --> 00:10:00,310
this will be theta n.

185
00:10:00,310 --> 00:10:03,050
But in general, there's this
issue that, for example,

186
00:10:03,050 --> 00:10:04,960
to iterate over every
item in the table you

187
00:10:04,960 --> 00:10:06,410
have to look at every slot.

188
00:10:06,410 --> 00:10:08,240
And so you have to
pay order m just

189
00:10:08,240 --> 00:10:12,290
to visit every slot, order n
to visit all those lists, m

190
00:10:12,290 --> 00:10:15,850
prime just to build the
new table, size m prime.

191
00:10:15,850 --> 00:10:18,110
Initialize it all to nil.

192
00:10:18,110 --> 00:10:18,610
Good.

193
00:10:22,484 --> 00:10:23,900
I guess another
main point here is

194
00:10:23,900 --> 00:10:25,860
that we have to build
a new hash function.

195
00:10:25,860 --> 00:10:27,693
Why do we need to build
a new hash function?

196
00:10:27,693 --> 00:10:32,200
Because the hash function--
why did I call it f prime?

197
00:10:32,200 --> 00:10:33,240
Calling it h prime.

198
00:10:37,300 --> 00:10:40,210
The hash function is all about
mapping the universe of keys

199
00:10:40,210 --> 00:10:42,160
to a table of size m.

200
00:10:42,160 --> 00:10:44,556
So if m changes, we definitely
need a new hash function.

201
00:10:44,556 --> 00:10:45,930
If you use the
old hash function,

202
00:10:45,930 --> 00:10:48,180
you would just use the
beginning of the table.

203
00:10:48,180 --> 00:10:51,610
If you add more slots down here,
you're not going to use them.

204
00:10:51,610 --> 00:10:53,300
For every key you've
got to rehash it,

205
00:10:53,300 --> 00:10:54,920
figure out where it goes.

206
00:10:54,920 --> 00:10:57,960
I think I've drilled
that home enough times.

207
00:10:57,960 --> 00:11:04,347
So the question becomes when we
see that our table is too big,

208
00:11:04,347 --> 00:11:05,430
we need to make it bigger.

209
00:11:05,430 --> 00:11:08,440
But how much bigger?

210
00:11:08,440 --> 00:11:10,156
Suggestions?

211
00:11:10,156 --> 00:11:11,122
Yeah?

212
00:11:11,122 --> 00:11:12,090
AUDIENCE: 2x.

213
00:11:12,090 --> 00:11:13,560
PROFESSOR: 2x.

214
00:11:13,560 --> 00:11:14,474
Twice m.

215
00:11:14,474 --> 00:11:15,140
Good suggestion.

216
00:11:15,140 --> 00:11:16,056
Any other suggestions?

217
00:11:18,331 --> 00:11:18,830
3x?

218
00:11:21,340 --> 00:11:24,090
OK.

219
00:11:24,090 --> 00:11:27,910
m prime equals 2 m is
the correct answer.

220
00:11:27,910 --> 00:11:31,430
But for fun, or
for pain I guess,

221
00:11:31,430 --> 00:11:33,660
let's think about the wrong
answer, which would be,

222
00:11:33,660 --> 00:11:35,690
just make it one bigger.

223
00:11:35,690 --> 00:11:37,950
That'll make m equal
to n again, so that

224
00:11:37,950 --> 00:11:39,790
seems-- it's at least safe.

225
00:11:39,790 --> 00:11:43,250
It will maintain my invariant
that m is at least n.

226
00:11:43,250 --> 00:11:48,510
I get this wrong-- sorry,
that's the wrong way.

227
00:11:48,510 --> 00:11:50,470
n is greater than m.

228
00:11:50,470 --> 00:11:54,700
I want m to be greater
than or equal to n.

229
00:11:54,700 --> 00:11:58,160
So if we just incremented
our table size,

230
00:11:58,160 --> 00:12:02,795
then the question becomes, what
is the cost of n insertions?

231
00:12:05,590 --> 00:12:08,180
So say we start
with an empty table

232
00:12:08,180 --> 00:12:10,120
and it has size
eight or whatever,

233
00:12:10,120 --> 00:12:15,100
some constant, and
we insert n times.

234
00:12:15,100 --> 00:12:17,540
Then after eight insertions
when we insert we

235
00:12:17,540 --> 00:12:19,220
have to rebuild
our entire table.

236
00:12:19,220 --> 00:12:20,710
That takes linear time.

237
00:12:20,710 --> 00:12:22,990
After we insert one
more, we have to rebuild.

238
00:12:22,990 --> 00:12:24,520
That takes linear time.

239
00:12:24,520 --> 00:12:28,500
And so the cost is going
to be something like,

240
00:12:28,500 --> 00:12:32,365
after you get to 8, it's going
to be 1 plus 2 plus 3 plus 4.

241
00:12:32,365 --> 00:12:34,540
So a triangular number.

242
00:12:34,540 --> 00:12:38,690
Every time we insert, we
have to rebuild everything.

243
00:12:38,690 --> 00:12:41,035
So this is quadratic,
this is bad.

244
00:12:45,970 --> 00:12:50,940
Fortunately, if all we do
is double m, we're golden.

245
00:12:50,940 --> 00:12:53,890
And this is sort
of the point of why

246
00:12:53,890 --> 00:12:56,862
it's called table-- I call
it table resizing there.

247
00:12:56,862 --> 00:12:58,320
Or to not give it
away, but this is

248
00:12:58,320 --> 00:12:59,850
a technique called
table doubling.

249
00:13:02,380 --> 00:13:05,960
And let's just think of
the cost of n insertions.

250
00:13:05,960 --> 00:13:07,200
There's also deletions.

251
00:13:07,200 --> 00:13:10,060
But if we just, again,
start with an empty table,

252
00:13:10,060 --> 00:13:12,470
and we repeatedly
insert, then the cost

253
00:13:12,470 --> 00:13:19,460
we get-- if we double each
time and we're inserting,

254
00:13:19,460 --> 00:13:23,550
after we get to 8, we
insert, we double to 16.

255
00:13:23,550 --> 00:13:26,540
Then we insert eight more
times, then we double to 32.

256
00:13:26,540 --> 00:13:30,540
Then we insert 16 times,
then we double to 64.

257
00:13:30,540 --> 00:13:32,180
All these numbers
are roughly the same.

258
00:13:32,180 --> 00:13:34,380
They're within a factor
of two of each other.

259
00:13:34,380 --> 00:13:36,670
Every time we're
rebuilding in linear time,

260
00:13:36,670 --> 00:13:40,990
but we're only doing
it like log end times.

261
00:13:40,990 --> 00:13:44,560
If we're going from one to
n, their log end growths--

262
00:13:44,560 --> 00:13:47,030
log end doublings
that we're able to do.

263
00:13:47,030 --> 00:13:49,887
So you might think,
oh, it's n log n.

264
00:13:49,887 --> 00:13:50,970
But we don't want n log n.

265
00:13:50,970 --> 00:13:52,386
That would be
binary search trees.

266
00:13:52,386 --> 00:13:54,430
We want to do
better than n log n.

267
00:13:54,430 --> 00:13:57,040
If you think about
the costs here,

268
00:13:57,040 --> 00:14:01,467
the cost to rebuild the first
time is concepts, like 8.

269
00:14:01,467 --> 00:14:03,300
And then the cost to
rebuild the second time

270
00:14:03,300 --> 00:14:06,310
is 16, so twice that.

271
00:14:06,310 --> 00:14:10,490
The cost to build
the next time is 64.

272
00:14:10,490 --> 00:14:14,240
So these go up geometrically.

273
00:14:14,240 --> 00:14:17,200
You've got to get from 1
to n with log end steps.

274
00:14:17,200 --> 00:14:18,910
The natural way to
do it is by doubling,

275
00:14:18,910 --> 00:14:22,230
and you can prove that
indeed this is the case.

276
00:14:22,230 --> 00:14:24,490
So this is a geometric series.

277
00:14:24,490 --> 00:14:26,140
Didn't mean to
cross it out there.

278
00:14:26,140 --> 00:14:27,473
And so this is theta n.

279
00:14:30,379 --> 00:14:32,670
Now, it's a little strange
to be talking about theta n.

280
00:14:32,670 --> 00:14:34,045
This is a data
structure supposed

281
00:14:34,045 --> 00:14:36,640
to be constant
time per operation.

282
00:14:36,640 --> 00:14:40,520
This data structure is not
constant time per operation.

283
00:14:40,520 --> 00:14:43,610
Even ignoring all
the hashing business,

284
00:14:43,610 --> 00:14:45,630
all you're trying to
do is grow a table.

285
00:14:45,630 --> 00:14:48,820
It takes more than constant
time for some operations.

286
00:14:48,820 --> 00:14:52,760
Near the end, you have
to rebuild the last time,

287
00:14:52,760 --> 00:14:55,330
you're restructuring
the entire table.

288
00:14:55,330 --> 00:14:59,470
That take linear time
for one operation.

289
00:14:59,470 --> 00:15:01,970
You might say that's bad.

290
00:15:01,970 --> 00:15:03,615
But the comforting
thing is that there

291
00:15:03,615 --> 00:15:05,640
are only a few operations,
log end of them,

292
00:15:05,640 --> 00:15:06,949
that are really expensive.

293
00:15:06,949 --> 00:15:08,240
The rest are all constant time.

294
00:15:08,240 --> 00:15:09,156
You don't do anything.

295
00:15:09,156 --> 00:15:11,560
You just add into the table.

296
00:15:11,560 --> 00:15:14,565
So this is an idea
we call amortization.

297
00:15:24,000 --> 00:15:26,865
Maybe I should write here--
we call this table doubling.

298
00:15:38,085 --> 00:15:40,585
So the idea with amortization,
let me give you a definition.

299
00:15:58,618 --> 00:16:00,690
Actually, I'm going to be
a little bit vague here

300
00:16:00,690 --> 00:16:03,507
and just say-- T of n.

301
00:16:03,507 --> 00:16:05,006
Let me see what it
says in my notes.

302
00:16:08,910 --> 00:16:09,520
Yeah.

303
00:16:09,520 --> 00:16:10,270
I say T of n.

304
00:16:37,590 --> 00:16:40,980
So we're going to use
a concept of-- usually

305
00:16:40,980 --> 00:16:43,890
we say running time is T of n.

306
00:16:43,890 --> 00:16:47,310
And we started saying
the expected running time

307
00:16:47,310 --> 00:16:51,790
is some T of n plus
alpha or whatever.

308
00:16:51,790 --> 00:16:55,050
Now, we're going to be able to
say the amortized running time

309
00:16:55,050 --> 00:16:59,440
is T of n, or the running
time is T of n amortized.

310
00:16:59,440 --> 00:17:01,490
That's what this is saying.

311
00:17:01,490 --> 00:17:04,569
And what that means
is that it's not

312
00:17:04,569 --> 00:17:06,337
any statement about
the individual running

313
00:17:06,337 --> 00:17:07,295
time of the operations.

314
00:17:07,295 --> 00:17:11,040
It's saying if you do a whole
bunch of operations, k of them,

315
00:17:11,040 --> 00:17:15,849
then the total running time
is, at most, k times T of n.

316
00:17:15,849 --> 00:17:18,190
This is a way to
amortize, or to-- yeah,

317
00:17:18,190 --> 00:17:22,050
amortize-- this is in the
economic sense of amortize,

318
00:17:22,050 --> 00:17:22,810
I guess.

319
00:17:22,810 --> 00:17:27,470
You spread out the high costs
so that's it's cheap on average

320
00:17:27,470 --> 00:17:28,700
all the time.

321
00:17:28,700 --> 00:17:32,640
It's kind of like-- normally,
we pay rent every month.

322
00:17:32,640 --> 00:17:34,780
But you could think of
it instead as you're only

323
00:17:34,780 --> 00:17:39,490
paying $50 a day or something
for your monthly rent.

324
00:17:39,490 --> 00:17:43,624
It's maybe-- if you want
to smooth things out,

325
00:17:43,624 --> 00:17:45,790
that would be a nice way
to think about paying rent,

326
00:17:45,790 --> 00:17:48,790
or every second you're
paying a penny or something.

327
00:17:52,610 --> 00:17:54,830
It's close, actually.

328
00:17:54,830 --> 00:17:57,350
Little bit off, factor or two.

329
00:17:57,350 --> 00:17:59,920
Anyway, so that's the idea.

330
00:17:59,920 --> 00:18:11,944
So you can think
of-- this is kind

331
00:18:11,944 --> 00:18:14,110
of like saying that the
running time of an operation

332
00:18:14,110 --> 00:18:16,530
is T of n on average.

333
00:18:16,530 --> 00:18:17,600
But put that in quotes.

334
00:18:17,600 --> 00:18:21,280
We don't usually use
that terminology.

335
00:18:21,280 --> 00:18:22,550
Maybe put a Tilda here.

336
00:18:25,440 --> 00:18:31,425
Where the average is taken
over all the operations.

337
00:18:36,641 --> 00:18:38,140
So this is something
that only makes

338
00:18:38,140 --> 00:18:39,380
sense for data structures.

339
00:18:39,380 --> 00:18:42,780
Data structures are things that
have lots of operations on them

340
00:18:42,780 --> 00:18:44,230
over time.

341
00:18:44,230 --> 00:18:47,550
And if you just-- instead of
counting individual operation

342
00:18:47,550 --> 00:18:50,820
times and then adding them up,
if you add them up and then

343
00:18:50,820 --> 00:18:52,842
divide by the number
of operations,

344
00:18:52,842 --> 00:18:54,300
that's your amortized
running time.

345
00:18:54,300 --> 00:18:58,740
So the point is,
in table doubling,

346
00:18:58,740 --> 00:19:06,120
the amortized running
time is beta 1.

347
00:19:06,120 --> 00:19:08,865
Because it's n in
total-- at this point

348
00:19:08,865 --> 00:19:10,460
we've only analyzed insertions.

349
00:19:10,460 --> 00:19:11,880
We haven't talked
about deletions.

350
00:19:23,340 --> 00:19:24,570
So k inserts.

351
00:19:28,480 --> 00:19:33,250
If we're just doing insertions,
take beta k time in total.

352
00:19:33,250 --> 00:19:44,415
So this means constant
amortized per insert.

353
00:19:44,415 --> 00:19:46,820
OK, it's a simple
idea, but a useful one

354
00:19:46,820 --> 00:19:50,760
because typically-- unless
you're in like a real time

355
00:19:50,760 --> 00:19:54,684
system-- you typically only
care about the overall running

356
00:19:54,684 --> 00:19:56,600
time of your algorithm,
which might use a data

357
00:19:56,600 --> 00:19:58,539
structure as a sub routine.

358
00:19:58,539 --> 00:20:00,330
You don't care if
individual operations are

359
00:20:00,330 --> 00:20:04,620
expensive as long as all the
operations together are cheap.

360
00:20:04,620 --> 00:20:06,790
You're using hashing to
solve some other problem,

361
00:20:06,790 --> 00:20:10,714
like counting duplicate
words in doc dist.

362
00:20:10,714 --> 00:20:13,130
You just care about the running
time of counting duplicate

363
00:20:13,130 --> 00:20:13,630
words.

364
00:20:13,630 --> 00:20:17,040
You don't care about how long
each step of the for loop

365
00:20:17,040 --> 00:20:20,420
takes, just the aggregate.

366
00:20:20,420 --> 00:20:22,020
So this is good
most of the time.

367
00:20:22,020 --> 00:20:24,550
And we've proved
it for insertions.

368
00:20:24,550 --> 00:20:30,430
It's also true when
you have deletions.

369
00:20:30,430 --> 00:20:31,870
You have k inserts and deletes.

370
00:20:38,530 --> 00:20:42,366
They certainly
take order k time.

371
00:20:42,366 --> 00:20:44,240
Actually, this is easy
to prove at this point

372
00:20:44,240 --> 00:20:45,809
because we haven't
changed delete.

373
00:20:45,809 --> 00:20:47,850
So, what delete does is
it just deletes something

374
00:20:47,850 --> 00:20:49,965
from the table, leaves
the table the same size.

375
00:20:54,020 --> 00:20:56,750
And so it actually
makes life better for us

376
00:20:56,750 --> 00:21:01,210
because if it decreases m,
in order to make m big again,

377
00:21:01,210 --> 00:21:03,820
you have to do more insertions
than you had to before.

378
00:21:03,820 --> 00:21:06,120
And the only extra cost
we're thinking about here

379
00:21:06,120 --> 00:21:10,890
is the growing, the rebuild
cost from inserting too big.

380
00:21:10,890 --> 00:21:12,220
And so this is still true.

381
00:21:12,220 --> 00:21:14,740
Deletions only help us.

382
00:21:14,740 --> 00:21:19,022
If you have k total inserts and
deletes, then still be order k.

383
00:21:19,022 --> 00:21:20,355
So still get constant amortized.

384
00:21:23,956 --> 00:21:26,980
But this is not
totally satisfying

385
00:21:26,980 --> 00:21:30,330
because of table
might get big again.

386
00:21:30,330 --> 00:21:32,800
m might become
much larger than n.

387
00:21:32,800 --> 00:21:35,710
For example, suppose
I do n inserts

388
00:21:35,710 --> 00:21:37,820
and then I do n deletes.

389
00:21:37,820 --> 00:21:41,070
So now I have an empty
table, n equals 0,

390
00:21:41,070 --> 00:21:44,990
but m is going to be around
the original value of n,

391
00:21:44,990 --> 00:21:47,280
or the maximum value
of n over time.

392
00:21:50,050 --> 00:21:54,710
So we can fix that.

393
00:21:54,710 --> 00:21:56,160
Suggestions on how to fix that?

394
00:22:00,860 --> 00:22:03,040
This is a little more subtle.

395
00:22:03,040 --> 00:22:04,550
There's two obvious answers.

396
00:22:04,550 --> 00:22:08,460
One is correct and the
other is incorrect.

397
00:22:08,460 --> 00:22:09,025
Yeah?

398
00:22:09,025 --> 00:22:09,900
AUDIENCE: [INAUDIBLE]

399
00:22:14,220 --> 00:22:15,660
PROFESSOR: Good.

400
00:22:15,660 --> 00:22:23,440
So option one is if the
table becomes half the size,

401
00:22:23,440 --> 00:22:30,980
then shrink-- to half the size?

402
00:22:30,980 --> 00:22:31,480
Sure.

403
00:22:37,390 --> 00:22:38,507
OK.

404
00:22:38,507 --> 00:22:39,590
That's on the right track.

405
00:22:39,590 --> 00:22:42,288
Anyone see a problem with that?

406
00:22:42,288 --> 00:22:43,240
Yeah?

407
00:22:43,240 --> 00:22:45,790
AUDIENCE: [INAUDIBLE] when
you're going from like 8 to 9,

408
00:22:45,790 --> 00:22:47,623
you can go from 8 to
9, 9 to 8, [INAUDIBLE].

409
00:22:47,623 --> 00:22:48,710
PROFESSOR: Good.

410
00:22:48,710 --> 00:22:57,150
So if you're sizing and say you
have eight items in your table,

411
00:22:57,150 --> 00:23:01,390
you add a ninth item
and so you double to 16.

412
00:23:01,390 --> 00:23:03,820
Then you delete that ninth
item, you're back to eight.

413
00:23:03,820 --> 00:23:06,440
And then you say oh,
now m equals n/2,

414
00:23:06,440 --> 00:23:08,620
so I'm going to shrink
to half the size.

415
00:23:08,620 --> 00:23:10,960
And if I insert again--
delete, insert, delete,

416
00:23:10,960 --> 00:23:15,014
insert-- I spend linear
time for every operation.

417
00:23:15,014 --> 00:23:15,930
So that's the problem.

418
00:23:18,810 --> 00:23:20,133
This is slow.

419
00:23:22,690 --> 00:23:27,830
If we go from 2 to the
k to 2 to the k plus 1,

420
00:23:27,830 --> 00:23:31,950
we go this way via-- oh
sorry, 2 to the k plus 1.

421
00:23:31,950 --> 00:23:36,010
Then, I said it right,
insert to go to the right,

422
00:23:36,010 --> 00:23:37,560
delete to go to the left.

423
00:23:37,560 --> 00:23:39,630
Then we'll get linear
time for operation.

424
00:23:44,550 --> 00:23:46,820
That is that.

425
00:23:46,820 --> 00:23:48,948
So, how do we fix this?

426
00:23:48,948 --> 00:23:50,310
Yeah.

427
00:23:50,310 --> 00:23:52,580
AUDIENCE: Maybe m
equal m/3 or something?

428
00:23:52,580 --> 00:23:53,910
PROFESSOR: M equals n over 3.

429
00:23:53,910 --> 00:23:54,783
Yep.

430
00:23:54,783 --> 00:23:56,699
AUDIENCE: And then still
leave it [INAUDIBLE].

431
00:24:04,926 --> 00:24:05,819
PROFESSOR: Good.

432
00:24:05,819 --> 00:24:07,360
I'm going to do 4,
if you don't mind.

433
00:24:07,360 --> 00:24:08,680
I'll keep it powers of 2.

434
00:24:08,680 --> 00:24:10,240
Any number bigger
than 3 will work--

435
00:24:10,240 --> 00:24:13,970
or any number bigger
than 2 will work here.

436
00:24:13,970 --> 00:24:17,700
But it's kind of nice to
stick to powers of two.

437
00:24:17,700 --> 00:24:18,380
Just for fun.

438
00:24:18,380 --> 00:24:20,463
I mean, doesn't really
matter because, as you say,

439
00:24:20,463 --> 00:24:22,910
we're still going to
shrink to half the size,

440
00:24:22,910 --> 00:24:27,350
but we're only going to trigger
it when we are 3/4 empty.

441
00:24:27,350 --> 00:24:29,290
We're only using a
quarter of the space.

442
00:24:29,290 --> 00:24:31,130
Then, it turns
out you can afford

443
00:24:31,130 --> 00:24:34,080
to shrink to half the
size because in order

444
00:24:34,080 --> 00:24:36,340
to explode again, in order
to need to grow again,

445
00:24:36,340 --> 00:24:42,900
you have to still insert
n over m-- m over 2 items.

446
00:24:42,900 --> 00:24:44,060
Because it's half empty.

447
00:24:44,060 --> 00:24:46,360
So when you're only
a quarter full,

448
00:24:46,360 --> 00:24:49,740
you shrink to become a half
full because then to grow again

449
00:24:49,740 --> 00:24:50,950
requires a lot of insertions.

450
00:24:50,950 --> 00:24:53,270
I haven't proved anything
here, but it turns out

451
00:24:53,270 --> 00:25:02,430
if you do this, the amortized
time becomes constant.

452
00:25:05,050 --> 00:25:09,360
For k insertions and deletions,
arbitrary combination,

453
00:25:09,360 --> 00:25:12,130
you'll maintain linear
size because of these two--

454
00:25:12,130 --> 00:25:13,840
because you're
maintaining the invariant

455
00:25:13,840 --> 00:25:21,830
that m is between n and 4n.

456
00:25:24,724 --> 00:25:25,890
You maintain that invariant.

457
00:25:25,890 --> 00:25:26,770
That's easy to check.

458
00:25:26,770 --> 00:25:28,850
So you always have linear size.

459
00:25:28,850 --> 00:25:31,842
And the amortized running
time becomes constant.

460
00:25:31,842 --> 00:25:34,050
We don't really have time
to prove that in the class.

461
00:25:34,050 --> 00:25:35,830
It's a little bit tricky.

462
00:25:35,830 --> 00:25:38,030
Read the textbook if
you want to know it.

463
00:25:41,640 --> 00:25:42,575
That's table doubling.

464
00:25:42,575 --> 00:25:43,075
Questions?

465
00:25:45,910 --> 00:25:47,790
All right.

466
00:25:47,790 --> 00:25:48,480
Boring.

467
00:25:48,480 --> 00:25:48,980
No.

468
00:25:48,980 --> 00:25:51,760
It's cool because
not only can we

469
00:25:51,760 --> 00:25:55,000
solve the hashing problem
of how do we set m in order

470
00:25:55,000 --> 00:25:58,700
to keep alpha a constant, we
can also solve Python lists.

471
00:25:58,700 --> 00:26:02,920
Python lists are also
known as resizable arrays.

472
00:26:05,980 --> 00:26:07,480
You may have wondered
how they work.

473
00:26:07,480 --> 00:26:09,120
Because they offer
random access,

474
00:26:09,120 --> 00:26:12,480
we can go to the ith
item in constant time

475
00:26:12,480 --> 00:26:14,930
and modify it or get the value.

476
00:26:14,930 --> 00:26:17,990
We can add a new item at
the end in constant time.

477
00:26:17,990 --> 00:26:19,640
That's append.

478
00:26:19,640 --> 00:26:20,140
list.append.

479
00:26:24,990 --> 00:26:28,830
And we can delete the last
item in constant time.

480
00:26:28,830 --> 00:26:30,375
One version is list.pop.

481
00:26:30,375 --> 00:26:35,300
It's also delete list,
square bracket minus 1.

482
00:26:35,300 --> 00:26:37,190
You should know that
deleting the first item

483
00:26:37,190 --> 00:26:38,070
is not constant time.

484
00:26:38,070 --> 00:26:39,528
That takes linear
time because what

485
00:26:39,528 --> 00:26:42,860
it does is it copies
all the values over.

486
00:26:42,860 --> 00:26:45,280
Python lists are
implemented by arrays.

487
00:26:45,280 --> 00:26:47,054
But how do you support
this dynamicness

488
00:26:47,054 --> 00:26:49,470
where you can increase the
length and decrease the length,

489
00:26:49,470 --> 00:26:51,630
and still keep linear space?

490
00:26:51,630 --> 00:26:53,860
Well, you do table doubling.

491
00:26:53,860 --> 00:26:55,360
And I don't know
whether Python uses

492
00:26:55,360 --> 00:26:58,550
two or some other
constant, but any constant

493
00:26:58,550 --> 00:27:00,710
will do, as long as the
deletion constant is

494
00:27:00,710 --> 00:27:05,350
smaller than the
insertion constant.

495
00:27:05,350 --> 00:27:06,690
And that's how they work.

496
00:27:06,690 --> 00:27:09,010
So in fact, list.append
and list.pop

497
00:27:09,010 --> 00:27:12,240
are constant amortized.

498
00:27:12,240 --> 00:27:14,920
Before, we just
said for simplicity,

499
00:27:14,920 --> 00:27:16,712
they're constant time
and for the most part

500
00:27:16,712 --> 00:27:18,544
you can just think of
them as constant time.

501
00:27:18,544 --> 00:27:20,720
But in reality, they
are constant amortized.

502
00:27:20,720 --> 00:27:23,110
Now for fun, just in
case you're curious,

503
00:27:23,110 --> 00:27:26,190
you can do all of this
stuff in constant worst case

504
00:27:26,190 --> 00:27:27,870
time per operation.

505
00:27:27,870 --> 00:27:30,870
May be a fun exercise.

506
00:27:30,870 --> 00:27:33,090
Do you want to know how?

507
00:27:33,090 --> 00:27:34,350
Yeah?

508
00:27:34,350 --> 00:27:39,680
Rough idea is when you
realize that you're

509
00:27:39,680 --> 00:27:43,240
getting kind of full, you
start building on the side

510
00:27:43,240 --> 00:27:45,740
a new table of twice the size.

511
00:27:45,740 --> 00:27:48,940
And every time you insert
into the actual table,

512
00:27:48,940 --> 00:27:52,280
you move like five of the
items over to the new table,

513
00:27:52,280 --> 00:27:54,900
or some constant-- it needs
to be a big enough constant.

514
00:27:54,900 --> 00:27:56,630
So that by the time
you're full, you just

515
00:27:56,630 --> 00:27:58,880
switch over immediately
to the other structure.

516
00:27:58,880 --> 00:28:00,454
It's kind of cool.

517
00:28:00,454 --> 00:28:02,370
It's very tricky to
actually get that to work.

518
00:28:02,370 --> 00:28:04,200
But if you're in a
real time system,

519
00:28:04,200 --> 00:28:05,710
you might care to know that.

520
00:28:05,710 --> 00:28:08,210
For the most part, people
don't implement those things

521
00:28:08,210 --> 00:28:09,790
because they're
complicated, but it

522
00:28:09,790 --> 00:28:11,905
is possible to get rid
of all these amortized.

523
00:28:14,590 --> 00:28:17,450
Cool.

524
00:28:17,450 --> 00:28:23,734
Let's move onto the next topic,
which is more hashing related.

525
00:28:23,734 --> 00:28:25,400
This was sort of
general data structures

526
00:28:25,400 --> 00:28:27,710
in order to implement
hashing with chaining,

527
00:28:27,710 --> 00:28:32,650
but didn't really care
about hashing per se.

528
00:28:32,650 --> 00:28:34,930
We assumed here that we can
evaluate the hash function

529
00:28:34,930 --> 00:28:37,660
in constant time, that we can
do insertion in constant time,

530
00:28:37,660 --> 00:28:39,517
but that's the name
of the game here.

531
00:28:39,517 --> 00:28:41,100
But otherwise, we
didn't really care--

532
00:28:41,100 --> 00:28:43,060
as long as the rebuilding
was linear time,

533
00:28:43,060 --> 00:28:44,100
this technique works.

534
00:28:48,240 --> 00:28:55,280
Now we're going to look
at a new problem that

535
00:28:55,280 --> 00:28:57,157
has lots of practical
applications.

536
00:28:57,157 --> 00:28:58,573
I mentioned some
of these problems

537
00:28:58,573 --> 00:29:02,750
in the last class, which
is string matching.

538
00:29:02,750 --> 00:29:04,120
This is essentially the problem.

539
00:29:04,120 --> 00:29:06,800
How many people have
used Grep in their life?

540
00:29:06,800 --> 00:29:09,510
OK, most of you.

541
00:29:09,510 --> 00:29:13,132
How many people have used
Find in a text editor?

542
00:29:13,132 --> 00:29:15,649
OK, the rest of you.

543
00:29:15,649 --> 00:29:17,440
And so this are the
same sorts of problems.

544
00:29:17,440 --> 00:29:20,670
You want to search for
a pattern, which is just

545
00:29:20,670 --> 00:29:23,360
going to be a substring
in some giant string which

546
00:29:23,360 --> 00:29:26,850
is your document, your
file, if you will.

547
00:29:26,850 --> 00:29:40,150
So state this formally--
given two strings, s and t,

548
00:29:40,150 --> 00:29:54,460
you want to know does s
occur as a substring of t?

549
00:29:54,460 --> 00:30:00,730
So for example, maybe
s is a string 6006

550
00:30:00,730 --> 00:30:06,680
and t is your entire--
the mail that you've ever

551
00:30:06,680 --> 00:30:10,130
received in your life or
your inbox, or something.

552
00:30:10,130 --> 00:30:12,920
So t is big, typically,
and s is small.

553
00:30:12,920 --> 00:30:14,409
It's what you type usually.

554
00:30:14,409 --> 00:30:16,450
Maybe you're searching
for all email from Piazza,

555
00:30:16,450 --> 00:30:20,040
so you put the Piazza
from string or whatever.

556
00:30:20,040 --> 00:30:22,255
You're searching for
that in this giant thing

557
00:30:22,255 --> 00:30:23,750
and you'd like to
do that quickly.

558
00:30:26,600 --> 00:30:29,990
Another application, s is
what you type in Google.

559
00:30:29,990 --> 00:30:31,620
t is the entire web.

560
00:30:31,620 --> 00:30:32,620
That's what Google does.

561
00:30:32,620 --> 00:30:36,570
It searches for the
string in the entire web.

562
00:30:36,570 --> 00:30:39,230
I'm not joking.

563
00:30:39,230 --> 00:30:40,660
OK?

564
00:30:40,660 --> 00:30:43,220
Fine.

565
00:30:43,220 --> 00:30:45,140
So we'd like to do that.

566
00:30:45,140 --> 00:30:49,830
What's the obvious way
to search for a substring

567
00:30:49,830 --> 00:30:51,968
in a giant string?

568
00:30:51,968 --> 00:30:53,924
Yeah?

569
00:30:53,924 --> 00:30:56,045
AUDIENCE: Check each
substring of that length.

570
00:30:56,045 --> 00:30:58,420
PROFESSOR: Just check each
substring of the right length.

571
00:30:58,420 --> 00:31:01,260
So it's got to be
the length of s.

572
00:31:01,260 --> 00:31:04,800
And there's only a linear number
of them, so check each one.

573
00:31:04,800 --> 00:31:05,640
Let's analyze that.

574
00:31:25,460 --> 00:31:36,740
So a simple
algorithm-- actually,

575
00:31:36,740 --> 00:31:38,865
just for fun, I have
pseudocode for it.

576
00:31:43,462 --> 00:31:45,450
I have Python code for it.

577
00:31:45,450 --> 00:31:46,944
Even more cool.

578
00:32:09,370 --> 00:32:09,940
OK.

579
00:32:09,940 --> 00:32:11,900
I don't know if you know
all these Python features,

580
00:32:11,900 --> 00:32:12,524
but you should.

581
00:32:12,524 --> 00:32:13,440
They're super cool.

582
00:32:13,440 --> 00:32:15,510
This is string splicing.

583
00:32:15,510 --> 00:32:19,890
So we're looking in t--
let me draw the picture.

584
00:32:19,890 --> 00:32:22,660
Here we have s, here we have t.

585
00:32:22,660 --> 00:32:24,080
Think of it as a big string.

586
00:32:24,080 --> 00:32:27,710
We'd like to
compare s like that,

587
00:32:27,710 --> 00:32:32,190
and then we'd like to compare
s shifted over one to see

588
00:32:32,190 --> 00:32:34,410
whether all of the
characters match there.

589
00:32:34,410 --> 00:32:37,790
And then shifted over
one more, and so on.

590
00:32:37,790 --> 00:32:40,200
And so we're looking at a
substring of t from position

591
00:32:40,200 --> 00:32:42,610
i the position i
plus the length of s,

592
00:32:42,610 --> 00:32:44,300
not including the last one.

593
00:32:44,300 --> 00:32:46,980
So that's of length
exactly, length of s.

594
00:32:46,980 --> 00:32:48,828
This is s.

595
00:32:48,828 --> 00:32:51,040
This is t.

596
00:32:51,040 --> 00:32:53,080
And so each of these
looks like that pattern.

597
00:32:53,080 --> 00:32:54,747
We compare s to t.

598
00:32:54,747 --> 00:32:56,205
What this comparison
operation does

599
00:32:56,205 --> 00:32:58,038
in Python is it checks
the first characters,

600
00:32:58,038 --> 00:32:59,260
see if they're equal.

601
00:32:59,260 --> 00:33:01,800
If they are, keep going
until they find a mismatch.

602
00:33:01,800 --> 00:33:04,510
If there's no mismatch,
then you return true.

603
00:33:04,510 --> 00:33:07,170
Otherwise, you return false.

604
00:33:07,170 --> 00:33:10,770
And then we do this
roughly length of t times

605
00:33:10,770 --> 00:33:13,270
because that's how many shifts
there are, except at the end

606
00:33:13,270 --> 00:33:14,150
we run out of room.

607
00:33:14,150 --> 00:33:16,570
We don't care if we
shift beyond the right

608
00:33:16,570 --> 00:33:18,520
because that's clearly
not going to match.

609
00:33:18,520 --> 00:33:20,750
And so it's actually length
of t minus like of s.

610
00:33:20,750 --> 00:33:22,970
That's the number of iterations.

611
00:33:22,970 --> 00:33:25,835
Hopefully I got all the
index arithmetic right.

612
00:33:25,835 --> 00:33:27,460
And there's no plus
ones or minus ones.

613
00:33:27,460 --> 00:33:30,220
I think this is correct.

614
00:33:30,220 --> 00:33:32,380
We want to know whether
any of these match.

615
00:33:32,380 --> 00:33:39,040
If so, the answer is yes, s
occurs as a substring of t.

616
00:33:39,040 --> 00:33:42,010
Of course, in reality you want
to know not just do any match,

617
00:33:42,010 --> 00:33:44,170
but show them to me,
things like that.

618
00:33:44,170 --> 00:33:47,307
But you can change that.

619
00:33:47,307 --> 00:33:48,140
Same amount of time.

620
00:33:48,140 --> 00:33:50,015
So what's the running
time of this algorithm?

621
00:33:54,916 --> 00:33:57,940
So my relevant things
are the length of s

622
00:33:57,940 --> 00:34:01,130
and the length of t.

623
00:34:01,130 --> 00:34:03,044
What's the running time?

624
00:34:03,044 --> 00:34:04,280
AUDIENCE: [INAUDIBLE]

625
00:34:04,280 --> 00:34:05,271
PROFESSOR: Sorry?

626
00:34:05,271 --> 00:34:06,145
AUDIENCE: [INAUDIBLE]

627
00:34:06,145 --> 00:34:08,520
PROFESSOR: T by-- t
multiplied by s, yeah.

628
00:34:08,520 --> 00:34:10,380
Exactly.

629
00:34:10,380 --> 00:34:14,739
Technically, it's length of
s times length of t minus s.

630
00:34:18,760 --> 00:34:24,429
But typically, this
is just s times t.

631
00:34:24,429 --> 00:34:25,900
And it's always
at most s times t,

632
00:34:25,900 --> 00:34:28,823
and it's usually the same thing
because s is usually smaller--

633
00:34:28,823 --> 00:34:30,239
at least a constant
factor than t.

634
00:34:32,800 --> 00:34:33,909
This is kind of slow.

635
00:34:33,909 --> 00:34:38,840
If you're searching for a big
string, it's not so great.

636
00:34:38,840 --> 00:34:41,870
I mean, certainly
you need s plus t.

637
00:34:41,870 --> 00:34:43,570
You've got to look
at the strings.

638
00:34:43,570 --> 00:34:45,935
But s times t is kind of--
it could be quadratic,

639
00:34:45,935 --> 00:34:49,199
if you're searching for a really
long string in another string.

640
00:34:49,199 --> 00:34:50,730
So what we'd like
to do today is use

641
00:34:50,730 --> 00:34:56,290
hashing to get this
down to linear time.

642
00:34:56,290 --> 00:34:58,795
So, ideas?

643
00:34:58,795 --> 00:34:59,670
How could we do that?

644
00:35:09,350 --> 00:35:10,390
Using hashing.

645
00:35:10,390 --> 00:35:11,613
Subtle hint.

646
00:35:11,613 --> 00:35:14,571
Yeah?

647
00:35:14,571 --> 00:35:17,529
AUDIENCE: If we take something
into account [INAUDIBLE].

648
00:35:24,916 --> 00:35:27,040
PROFESSOR: OK, so you want
to decompose your string

649
00:35:27,040 --> 00:35:28,840
into words and use
the fact that there

650
00:35:28,840 --> 00:35:30,786
are fewer words than characters.

651
00:35:30,786 --> 00:35:32,660
You could probably get
something out of that,

652
00:35:32,660 --> 00:35:37,050
and old search engines
used to do that.

653
00:35:37,050 --> 00:35:40,050
But it's not
necessary, turns out.

654
00:35:40,050 --> 00:35:44,470
And it will also depend on what
your average word length is.

655
00:35:44,470 --> 00:35:47,882
We are, in the end, today, we're
not going to analyze it fully,

656
00:35:47,882 --> 00:35:49,590
but we are going to
get an algorithm that

657
00:35:49,590 --> 00:35:52,490
runs in this time guaranteed.

658
00:35:52,490 --> 00:35:56,891
In expectation because
of a randomized-- yeah?

659
00:35:56,891 --> 00:36:00,867
AUDIENCE: If we were to hash
[INAUDIBLE] size s, that would

660
00:36:00,867 --> 00:36:06,165
[INAUDIBLE] and then we would
check the hash [INAUDIBLE].

661
00:36:06,165 --> 00:36:06,831
PROFESSOR: Good.

662
00:36:06,831 --> 00:36:09,000
So the idea is to--
what we're looking

663
00:36:09,000 --> 00:36:12,070
at is a rolling window
of t always of size s.

664
00:36:12,070 --> 00:36:15,250
And at each time we want to
know, is it the same as s?

665
00:36:15,250 --> 00:36:17,667
Now, if somehow-- it's
expensive to check

666
00:36:17,667 --> 00:36:19,250
whether a string is
equal to a string.

667
00:36:19,250 --> 00:36:21,202
There's no way
getting around that.

668
00:36:21,202 --> 00:36:24,610
Well, there are ways, but
there isn't a way for just

669
00:36:24,610 --> 00:36:25,640
given two strings.

670
00:36:25,640 --> 00:36:28,330
But if somehow instead of
checking the strings we

671
00:36:28,330 --> 00:36:30,630
could check a hash
function of the strings,

672
00:36:30,630 --> 00:36:32,340
because strings are
big, potentially.

673
00:36:32,340 --> 00:36:34,220
We don't know how big s is.

674
00:36:34,220 --> 00:36:37,790
And so the universe
of strings of length s

675
00:36:37,790 --> 00:36:39,336
is potentially very big.

676
00:36:39,336 --> 00:36:40,710
It's expensive to
compare things.

677
00:36:40,710 --> 00:36:43,190
If we could just hash it
down to some reasonable size,

678
00:36:43,190 --> 00:36:44,740
to something that
fits in a word,

679
00:36:44,740 --> 00:36:46,990
then we can compare whether
those two words are equal,

680
00:36:46,990 --> 00:36:48,760
whether those two
hash values are equal,

681
00:36:48,760 --> 00:36:51,660
whether there's a
collision in the table.

682
00:36:51,660 --> 00:36:55,500
That would somehow-- that
would make things go faster.

683
00:36:55,500 --> 00:36:59,940
We could do that in
constant time per operation.

684
00:36:59,940 --> 00:37:01,760
How could we do that?

685
00:37:01,760 --> 00:37:06,480
That's the tricky part, but
that is exactly the right idea.

686
00:37:06,480 --> 00:37:12,410
So-- make some space.

687
00:37:28,130 --> 00:37:32,220
I think I'm going to do
things a little out of order

688
00:37:32,220 --> 00:37:34,200
from what I have in
my notes, and tell you

689
00:37:34,200 --> 00:37:37,690
about something
called rolling hashes.

690
00:37:37,690 --> 00:37:40,870
And then we'll see
how they're used.

691
00:37:40,870 --> 00:37:42,170
So shelve that idea.

692
00:37:42,170 --> 00:37:43,720
We're going to come back to it.

693
00:37:43,720 --> 00:37:48,430
We need a data structure
to help us do this.

694
00:37:48,430 --> 00:37:51,380
Because if we just compute the
hash function of this thing,

695
00:37:51,380 --> 00:37:52,680
compare it to the hash
function of this thing,

696
00:37:52,680 --> 00:37:54,971
and then compute the hash
function of the shifted value

697
00:37:54,971 --> 00:37:56,650
of t and compare
it, we don't have

698
00:37:56,650 --> 00:37:58,350
to recompute the hash of s.

699
00:37:58,350 --> 00:38:00,960
That's going to be free
once you do it once.

700
00:38:00,960 --> 00:38:02,650
But computing the
hash function of this

701
00:38:02,650 --> 00:38:03,433
and then the hash
function of this

702
00:38:03,433 --> 00:38:05,190
and the hash function
of this, usually

703
00:38:05,190 --> 00:38:06,773
to compute each of
those hash function

704
00:38:06,773 --> 00:38:08,355
would take length of s time.

705
00:38:08,355 --> 00:38:10,810
And so we're not
saving any time.

706
00:38:10,810 --> 00:38:13,660
Somehow, if we have the
hash function of this,

707
00:38:13,660 --> 00:38:15,860
the first substring
of length s, we'd

708
00:38:15,860 --> 00:38:18,010
like to very quickly
compute the hash function

709
00:38:18,010 --> 00:38:21,706
of the next substring
in constant time.

710
00:38:21,706 --> 00:38:22,206
Yeah?

711
00:38:22,206 --> 00:38:24,158
AUDIENCE: You already
have, like, s minus 1

712
00:38:24,158 --> 00:38:25,444
of the characters of the--

713
00:38:25,444 --> 00:38:26,110
PROFESSOR: Yeah.

714
00:38:26,110 --> 00:38:30,690
If you look at this portion
of s and this portion of s,

715
00:38:30,690 --> 00:38:33,600
they share s minus
1 of the characters.

716
00:38:33,600 --> 00:38:35,540
Just one character different.

717
00:38:35,540 --> 00:38:38,980
First one gets deleted,
last character gets added.

718
00:38:38,980 --> 00:38:39,980
So here's what we want.

719
00:38:43,230 --> 00:38:47,960
Given a hash value-- maybe
I should call this r.

720
00:38:47,960 --> 00:38:49,280
It's not the hash function.

721
00:38:51,820 --> 00:38:53,670
Give it a rolling hash value.

722
00:38:53,670 --> 00:38:57,490
You might say, I'd like to be
able to append a character.

723
00:38:57,490 --> 00:39:05,480
I should say, r
maintains a string.

724
00:39:09,100 --> 00:39:13,740
There's some string,
let's call it x.

725
00:39:13,740 --> 00:39:18,730
And what r.append
of c does is add

726
00:39:18,730 --> 00:39:22,715
character c to the end of x.

727
00:39:25,620 --> 00:39:27,460
And then we also
want an operation

728
00:39:27,460 --> 00:39:30,160
which is-- you might call
it pop left in Python.

729
00:39:30,160 --> 00:39:32,900
I'm going to call it skip.

730
00:39:32,900 --> 00:39:34,480
Shorter.

731
00:39:34,480 --> 00:39:40,880
Delete the first character of x.

732
00:39:44,370 --> 00:39:52,690
And assuming it's c.

733
00:39:52,690 --> 00:39:55,370
So we can do this
because over here,

734
00:39:55,370 --> 00:39:58,610
what we want to do is
add this character, which

735
00:39:58,610 --> 00:40:01,960
is like t of length of s.

736
00:40:01,960 --> 00:40:03,730
And we want to
delete this character

737
00:40:03,730 --> 00:40:06,080
from the front, which is t of 0.

738
00:40:06,080 --> 00:40:10,125
Then we will get
the next strength.

739
00:40:10,125 --> 00:40:15,200
And at all times, r--
what's the point of this r?

740
00:40:15,200 --> 00:40:18,370
You can say r-- let's
say open paren, close

741
00:40:18,370 --> 00:40:22,030
paren-- this will
give you a hash value

742
00:40:22,030 --> 00:40:25,280
of the current strength.

743
00:40:25,280 --> 00:40:29,930
So this is basically h of
x for some hash function

744
00:40:29,930 --> 00:40:32,882
h, some reasonable
hash function.

745
00:40:32,882 --> 00:40:34,340
If we could do this
and we could do

746
00:40:34,340 --> 00:40:37,390
each of these operations
in constant time,

747
00:40:37,390 --> 00:40:39,070
then we can do string matching.

748
00:40:39,070 --> 00:40:42,210
Let me tell you how.

749
00:40:42,210 --> 00:40:45,530
This is called the Karp-Rabin
string matching algorithm.

750
00:40:56,269 --> 00:40:58,760
And if it's not clear
exactly what's allowed here,

751
00:40:58,760 --> 00:41:00,510
you'll see it as we use it.

752
00:41:12,740 --> 00:41:15,500
First thing I'd like to do is
compute the hash function of s.

753
00:41:15,500 --> 00:41:17,992
I only need to do that
once, so I'll do it.

754
00:41:17,992 --> 00:41:20,450
In this data structure, the
only thing you're allowed to do

755
00:41:20,450 --> 00:41:21,310
is add characters.

756
00:41:21,310 --> 00:41:23,110
Initially you have
an empty string.

757
00:41:23,110 --> 00:41:25,770
And so for each character
in s I'll just append it,

758
00:41:25,770 --> 00:41:30,510
and now rs gives me
a hash value of s.

759
00:41:30,510 --> 00:41:31,880
OK?

760
00:41:31,880 --> 00:41:37,340
Now, I'd like to get
started and compute

761
00:41:37,340 --> 00:41:40,480
the hash function of the
first s characters of t.

762
00:41:43,840 --> 00:41:48,250
So this would be t
up to length of s.

763
00:41:51,650 --> 00:41:54,060
And I'm going to call
this thing rt, that's

764
00:41:54,060 --> 00:41:55,585
my rolling hash for t.

765
00:41:59,694 --> 00:42:00,860
And append those characters.

766
00:42:00,860 --> 00:42:03,820
So now rs is a
rolling hash of s.

767
00:42:03,820 --> 00:42:07,934
rt is a rolling hash of the
first s characters in t.

768
00:42:07,934 --> 00:42:09,600
So I should check
whether they're equal.

769
00:42:09,600 --> 00:42:12,790
If they're not,
shift over by one.

770
00:42:12,790 --> 00:42:14,706
Add one character at the
end, delete character

771
00:42:14,706 --> 00:42:15,980
from the beginning.

772
00:42:15,980 --> 00:42:18,970
I'm going to have to
do this many times.

773
00:42:18,970 --> 00:42:23,360
So I guess technically,
I need to check

774
00:42:23,360 --> 00:42:24,610
whether these are equal first.

775
00:42:27,580 --> 00:42:31,052
If they're equal, then we'll
talk about it in a moment.

776
00:42:31,052 --> 00:42:32,510
The main thing I
need to do is this

777
00:42:32,510 --> 00:42:35,780
for loop, which checks
all of the other.

778
00:42:45,530 --> 00:42:50,250
And all I need to do is throw
away the first letter, which

779
00:42:50,250 --> 00:42:53,425
I know is t of i
minus length of s.

780
00:42:56,240 --> 00:42:59,970
And add the next letter,
which is going to be t of i.

781
00:43:04,660 --> 00:43:07,376
And then after I do
that, I don't change hs

782
00:43:07,376 --> 00:43:08,250
because that's fixed.

783
00:43:08,250 --> 00:43:10,400
That's just-- or,
sorry, I switched from h

784
00:43:10,400 --> 00:43:13,860
to-- in my notes I have h.

785
00:43:13,860 --> 00:43:17,615
I've been switching to r,
so all those h's are r's.

786
00:43:17,615 --> 00:43:20,440
Sorry about that.

787
00:43:20,440 --> 00:43:36,410
So then if rs equals rt, then
potentially that substring of t

788
00:43:36,410 --> 00:43:38,120
matches s.

789
00:43:38,120 --> 00:43:41,410
But it's potentially
because we're hashing.

790
00:43:41,410 --> 00:43:44,830
Things are only
true in expectation.

791
00:43:44,830 --> 00:43:47,020
There's some
probability of failure.

792
00:43:47,020 --> 00:43:50,820
Just because the hash function
of two strings comes out equal

793
00:43:50,820 --> 00:43:52,930
doesn't mean the strings
themselves are equal,

794
00:43:52,930 --> 00:43:54,910
because there are collisions.

795
00:43:54,910 --> 00:43:58,520
Even distinct strings may map
to the same slot in the table.

796
00:43:58,520 --> 00:44:04,810
So what we do in this
situation is check

797
00:44:04,810 --> 00:44:13,630
whether s equals t-- I did
it slightly less conveniently

798
00:44:13,630 --> 00:44:23,100
than before-- it's like i minus
length of s plus 1 to i plus 1.

799
00:44:23,100 --> 00:44:23,790
Oh well.

800
00:44:23,790 --> 00:44:28,790
It wasn't very
beautiful but it works.

801
00:44:28,790 --> 00:44:31,100
So in this case, I'm going
to check it character

802
00:44:31,100 --> 00:44:32,810
by character.

803
00:44:32,810 --> 00:44:34,840
OK?

804
00:44:34,840 --> 00:44:39,330
If they're equal,
then we found a match.

805
00:44:39,330 --> 00:44:42,537
So it's kind of OK that I spent
all this time to check them.

806
00:44:42,537 --> 00:44:44,870
In particular, if I'm just
looking for the first match--

807
00:44:44,870 --> 00:44:47,410
like you're searching
through a text document,

808
00:44:47,410 --> 00:44:50,660
you just care about the first
match-- then you're done.

809
00:44:50,660 --> 00:44:54,494
So yeah, I spent order
s time to do this,

810
00:44:54,494 --> 00:44:56,660
but if they're equal it's
sort of worth that effort.

811
00:44:56,660 --> 00:44:58,760
I found the match.

812
00:44:58,760 --> 00:45:02,950
If they're not
equal, we basically

813
00:45:02,950 --> 00:45:07,160
hope or we will
engineer it so that this

814
00:45:07,160 --> 00:45:09,770
happens with
probability at most 1/s.

815
00:45:20,840 --> 00:45:24,775
If we can do that, then the
expected time here is constant.

816
00:45:33,910 --> 00:45:40,530
So that would be good because
then, if skip and append

817
00:45:40,530 --> 00:45:44,889
take constant time and this
sort of double checking

818
00:45:44,889 --> 00:45:47,180
only takes constant expected
time-- except when we find

819
00:45:47,180 --> 00:45:49,480
matches and then
we're OK with it--

820
00:45:49,480 --> 00:45:52,760
then this overall thing
will take linear time.

821
00:45:52,760 --> 00:46:02,570
In fact, the proper thing would
be this is you pay s plus t,

822
00:46:02,570 --> 00:46:07,370
then you also pay-- for each
match that you want to report,

823
00:46:07,370 --> 00:46:08,730
you pay length of s.

824
00:46:10,947 --> 00:46:13,030
I'm not sure whether you
can get rid of that term.

825
00:46:13,030 --> 00:46:15,196
But in particular, if you
just care about one match,

826
00:46:15,196 --> 00:46:16,080
this is linear time.

827
00:46:19,167 --> 00:46:20,250
It's pretty cool.

828
00:46:23,670 --> 00:46:25,280
There's one remaining
question, which

829
00:46:25,280 --> 00:46:28,400
is how do you build
this data structure?

830
00:46:28,400 --> 00:46:30,700
Is the algorithm clear though?

831
00:46:30,700 --> 00:46:32,270
I mean, I wrote it
out in gory detail

832
00:46:32,270 --> 00:46:33,895
so you can really
see what's happening,

833
00:46:33,895 --> 00:46:36,070
also because you need to
do it in your problem set

834
00:46:36,070 --> 00:46:39,660
so I give you as much code
to work from as possible.

835
00:46:39,660 --> 00:46:40,800
Question?

836
00:46:40,800 --> 00:46:42,300
AUDIENCE: What is rs?

837
00:46:42,300 --> 00:46:49,100
PROFESSOR: rs is going to
represent a hash value of s.

838
00:46:49,100 --> 00:46:50,770
You could just say h of s.

839
00:46:50,770 --> 00:46:54,090
But what I like to show
is that all you need

840
00:46:54,090 --> 00:46:55,644
are these operations.

841
00:46:55,644 --> 00:46:57,060
And so given a
data structure that

842
00:46:57,060 --> 00:47:01,620
will compute a hash function,
given the append operation,

843
00:47:01,620 --> 00:47:06,330
what I did up here was just
append every letter of s

844
00:47:06,330 --> 00:47:08,970
into this thing, and
then rs open paren,

845
00:47:08,970 --> 00:47:11,673
close paren gives me
the hash function of s.

846
00:47:11,673 --> 00:47:13,714
AUDIENCE: You said you
can do r.append over here,

847
00:47:13,714 --> 00:47:15,690
but then you said rs--

848
00:47:15,690 --> 00:47:16,580
PROFESSOR: Yeah.

849
00:47:16,580 --> 00:47:18,140
So there are two rolling hashes.

850
00:47:18,140 --> 00:47:22,500
One's called rs and
one's called rt.

851
00:47:22,500 --> 00:47:26,040
This was an ADT and I didn't
say it at the beginning-- line

852
00:47:26,040 --> 00:47:28,640
one I say rs equals a new
rolling hash. rt equals

853
00:47:28,640 --> 00:47:29,890
a new rolling hash.

854
00:47:29,890 --> 00:47:32,574
Sorry, I should
bind my variables.

855
00:47:32,574 --> 00:47:33,990
So I'm using two
of them because I

856
00:47:33,990 --> 00:47:36,700
want to compare their
values, like this.

857
00:47:39,390 --> 00:47:42,090
Other questions?

858
00:47:42,090 --> 00:47:43,450
It's actually a pretty big idea.

859
00:47:43,450 --> 00:47:48,570
This is an algorithm from the
'90s, so it's fairly recent.

860
00:47:51,769 --> 00:47:54,290
And it's one of
the first examples

861
00:47:54,290 --> 00:47:58,120
of really using randomization
in a super cool way, other

862
00:47:58,120 --> 00:48:00,070
than just hashing
as a data structure.

863
00:48:04,240 --> 00:48:04,790
All right.

864
00:48:04,790 --> 00:48:08,400
So the remaining thing
to do is figure out

865
00:48:08,400 --> 00:48:09,932
how to build this ADT.

866
00:48:09,932 --> 00:48:11,640
What's the data
structure that implements

867
00:48:11,640 --> 00:48:15,015
this, spending constant time
for each of these operations.

868
00:48:24,370 --> 00:48:25,820
Now, to tell you
the truth, doing

869
00:48:25,820 --> 00:48:28,750
it depends on which hashing
method you use, which hash

870
00:48:28,750 --> 00:48:30,916
function you want to use.

871
00:48:30,916 --> 00:48:32,540
I just erased the
multiplication method

872
00:48:32,540 --> 00:48:34,748
because it's a pain to use
the multiplication method.

873
00:48:40,360 --> 00:48:42,900
Though I'll bet you
could use it, actually.

874
00:48:42,900 --> 00:48:45,251
That's an exercise
for you think about.

875
00:48:45,251 --> 00:48:46,750
I'm going to use
the division method

876
00:48:46,750 --> 00:48:48,660
because it's the
simplest hash function.

877
00:48:48,660 --> 00:48:50,790
And it turns out, in this
setting it does work.

878
00:48:50,790 --> 00:48:53,760
We're not going to
prove that this is true.

879
00:48:53,760 --> 00:48:56,860
This is going to be
true in expectation.

880
00:48:56,860 --> 00:48:57,635
Expected time.

881
00:49:02,110 --> 00:49:06,270
But Karp and Rabin proved
that this running time

882
00:49:06,270 --> 00:49:09,040
holds, even if you
just use a simple hash

883
00:49:09,040 --> 00:49:11,570
function of the
division method where

884
00:49:11,570 --> 00:49:13,755
m is chosen to be
a random prime.

885
00:49:18,800 --> 00:49:22,170
Let's say about is big
as-- let's say at least as

886
00:49:22,170 --> 00:49:26,050
big as length of s.

887
00:49:26,050 --> 00:49:28,340
The bigger you make it,
the higher probability this

888
00:49:28,340 --> 00:49:29,640
is going to be true.

889
00:49:29,640 --> 00:49:34,484
But length of s will
give you this on average.

890
00:49:34,484 --> 00:49:36,400
So we're not going to
talk about in this class

891
00:49:36,400 --> 00:49:39,620
how to find a random
prime, but the algorithm

892
00:49:39,620 --> 00:49:42,650
is choose a random number
of about the right size

893
00:49:42,650 --> 00:49:44,030
and check whether it's prime.

894
00:49:44,030 --> 00:49:46,030
If it's not, do it again.

895
00:49:46,030 --> 00:49:50,100
And by the prime number
theorem, after log end trials

896
00:49:50,100 --> 00:49:51,779
you will find a prime.

897
00:49:51,779 --> 00:49:53,320
And we're not going
to talk about how

898
00:49:53,320 --> 00:49:57,870
to check whether a number's
prime, but it can be done.

899
00:49:57,870 --> 00:49:58,480
All right.

900
00:49:58,480 --> 00:50:02,220
So we're basically done.

901
00:50:02,220 --> 00:50:10,630
The point is to look at-- if
you look at an append operation

902
00:50:10,630 --> 00:50:15,380
and you think about how
this hash function changes

903
00:50:15,380 --> 00:50:17,350
when you add a single character.

904
00:50:17,350 --> 00:50:20,470
Oh, I should tell you.

905
00:50:20,470 --> 00:50:25,950
We're going to treat the string
x as a multi digit number.

906
00:50:29,650 --> 00:50:31,220
This is the sort of
prehash function.

907
00:50:36,480 --> 00:50:39,365
And the base is the
size of your alphabet.

908
00:50:42,750 --> 00:50:45,920
So if you're using
Ascii, it's 256.

909
00:50:45,920 --> 00:50:48,860
If you're using some unique
code, it might be larger.

910
00:50:48,860 --> 00:50:52,750
But whatever the size of your
characters in your string,

911
00:50:52,750 --> 00:50:56,950
then when I add a character,
this is like taking my number,

912
00:50:56,950 --> 00:51:00,660
shifting it over by one,
and then adding a new value.

913
00:51:00,660 --> 00:51:02,390
So how do I shift over by one?

914
00:51:02,390 --> 00:51:04,630
I multiply by a.

915
00:51:04,630 --> 00:51:10,330
So if I have some value,
some current hash value u,

916
00:51:10,330 --> 00:51:13,440
it changes to u
times a-- or sorry,

917
00:51:13,440 --> 00:51:17,890
this is the number
represented by the string.

918
00:51:17,890 --> 00:51:20,460
I multiply by a and then
I add on the character.

919
00:51:20,460 --> 00:51:23,620
Or, in Python you'd write
ord of the character.

920
00:51:23,620 --> 00:51:27,860
That's the number associated
with that character.

921
00:51:27,860 --> 00:51:29,160
That gives me the new string.

922
00:51:29,160 --> 00:51:29,770
Very easy.

923
00:51:29,770 --> 00:51:33,720
If I want to do is skip,
it's slightly more annoying.

924
00:51:33,720 --> 00:51:37,290
But skip means just
annihilate this value.

925
00:51:37,290 --> 00:51:45,310
And so it's like u goes to u
minus the character times a

926
00:51:45,310 --> 00:51:48,980
to the power size of u minus 1.

927
00:51:48,980 --> 00:51:52,080
I have to shift this character
over to that position

928
00:51:52,080 --> 00:51:53,830
and then annihilated
it with a minus sign.

929
00:51:53,830 --> 00:51:56,340
You could also do x or.

930
00:51:56,340 --> 00:51:58,820
And when I do this, I
just think about how

931
00:51:58,820 --> 00:52:00,250
the hash function is changing.

932
00:52:00,250 --> 00:52:02,540
Everything is just modulo m.

933
00:52:02,540 --> 00:52:05,370
So if I have some
hash value here, r,

934
00:52:05,370 --> 00:52:10,000
I take r times a plus
ord of c and I just

935
00:52:10,000 --> 00:52:13,210
do that computation
modulo m, and I'll

936
00:52:13,210 --> 00:52:15,140
get the new hash value.

937
00:52:15,140 --> 00:52:18,630
Do the same thing down here,
I'll get the new hash value.

938
00:52:18,630 --> 00:52:22,730
So what r stores is
the current hash value.

939
00:52:22,730 --> 00:52:27,810
And it stores a to the
power length of u or length

940
00:52:27,810 --> 00:52:30,200
of x, whatever you
want to call it.

941
00:52:30,200 --> 00:52:33,606
I guess that would
be a little better.

942
00:52:33,606 --> 00:52:35,480
And then it can do these
in constant a number

943
00:52:35,480 --> 00:52:36,280
of operations.

944
00:52:36,280 --> 00:52:37,955
Just compute
everything modulo m,

945
00:52:37,955 --> 00:52:40,124
one multiplication,
one addition.

946
00:52:40,124 --> 00:52:41,790
You can do append and
skip, and then you

947
00:52:41,790 --> 00:52:43,560
have the hash value instantly.

948
00:52:43,560 --> 00:52:44,820
It's just stored.

949
00:52:44,820 --> 00:52:47,330
And then you can
make all this work.