1
00:00:07,000 --> 00:00:11,000
So, there is a lot of

2
00:00:07,000 --> 00:00:11,000
Today we're going to not talk
about sorting.

3
00:00:11,000 --> 00:00:14,000
This is an exciting new
development.
We know that what we're looking

4
00:00:14,000 --> 00:00:18,000
We're going to talk about
another problem,

5
00:00:18,000 --> 00:00:23,000
a related problem,
but a different problem.

6
00:00:35,000 --> 00:00:38,000
We're going to talk about
another problem that we would

7
00:00:38,000 --> 00:00:41,000
like to solve in linear time.
Last class we talked about we

8
00:00:41,000 --> 00:00:44,000
could do sorting in linear time.
To do that we needed some

9
00:00:44,000 --> 00:00:47,000
additional assumptions.
Today we're going to look at a

10
00:00:47,000 --> 00:00:51,000
problem that really only needs
linear time, even though at

11
00:00:51,000 --> 00:00:54,000
first glance it might look like
it requires sorting.

12
00:00:54,000 --> 00:00:56,000
So this is going to be an
easier problem.

13
00:00:56,000 --> 00:01:00,000
The problem is I give you a
bunch of numbers.

14
00:01:00,000 --> 00:01:06,000
Let's call them elements.
And they are in some array,

15
00:01:06,000 --> 00:01:11,000
let's say.
And they're in no particular

16
00:01:11,000 --> 00:01:18,000
order, so unsorted.
I want to find the kth smallest

17
00:01:18,000 --> 00:01:20,000
element.

18
00:01:26,000 --> 00:01:30,000
This is called the element of
rank k.

19
00:01:37,000 --> 00:01:39,000
In other words,
I have this list of numbers

20
00:01:39,000 --> 00:01:43,000
which is unsorted.
And, if I were to sort it,

21
00:01:43,000 --> 00:01:46,000
I would like to know what the
kth element is.

22
00:01:46,000 --> 00:01:50,000
But I'm not allowed to sort it.
One solution to this problem,

23
00:01:50,000 --> 00:01:54,000
this is the naÔve algorithm,
is you just sort and then

24
00:01:54,000 --> 00:01:57,000
return the kth element.
This is another possible

25
00:01:57,000 --> 00:02:03,000
definition of the problem.
And we would like to do better

26
00:02:03,000 --> 00:02:05,000
than that.
So you could sort,

27
00:02:05,000 --> 00:02:10,000
what's called the array A,
and then return A[k].

28
00:02:10,000 --> 00:02:16,000
That is one thing we could do.
And if we use heap sort or

29
00:02:16,000 --> 00:02:20,000
mergesort, this will take n lg n
time.

30
00:02:20,000 --> 00:02:23,000
We would like to do better than
n lg n.

31
00:02:23,000 --> 00:02:29,000
Ideally linear time.
The problem is pretty natural,

32
00:02:29,000 --> 00:02:34,000
straightforward.
It has various applications.

33
00:02:34,000 --> 00:02:39,000
Depending on how you choose k,
k could be any number between 1

34
00:02:39,000 --> 00:02:41,000
and n.
For example,

35
00:02:41,000 --> 00:02:44,000
if we choose k=1 that element
has a name.

36
00:02:44,000 --> 00:02:47,000
Any suggestions of what the
name is?

37
00:02:47,000 --> 00:02:50,000
The minimum.
That's easy.

38
00:02:50,000 --> 00:02:55,000
Any suggestions on how we could
find the minimum element in an

39
00:02:55,000 --> 00:02:59,000
array in linear time?
Right.

40
00:02:59,000 --> 00:03:04,000
Just scan through the array.
Keep track of what the smallest

41
00:03:04,000 --> 00:03:08,000
number is that you've seen.
The same thing with the

42
00:03:08,000 --> 00:03:12,000
maximum, k=n.
These are rather trivial.

43
00:03:12,000 --> 00:03:17,000
But a more interesting version
of the order statistic problem

44
00:03:17,000 --> 00:03:21,000
is to find the median.
This is either k equals n plus

45
00:03:21,000 --> 00:03:26,000
1 over 2 floor or ceiling.
I will call both of those

46
00:03:26,000 --> 00:03:29,000
elements medians.

47
00:03:34,000 --> 00:03:37,000
Finding the median of an
unsorted array in linear time is

48
00:03:37,000 --> 00:03:39,000
quite tricky.
And that sort of is the main

49
00:03:39,000 --> 00:03:41,000
goal of this lecture,
is to be able to find the

50
00:03:41,000 --> 00:03:43,000
medians.
For free we're going to be able

51
00:03:43,000 --> 00:03:46,000
to find the arbitrary kth
smallest element,

52
00:03:46,000 --> 00:03:48,000
but typically we're most
interested in finding the

53
00:03:48,000 --> 00:03:50,000
median.
And on Friday in recitation

54
00:03:50,000 --> 00:03:52,000
you'll see why that is so
useful.

55
00:03:52,000 --> 00:03:55,000
There are all sorts of
situations where you can use

56
00:03:55,000 --> 00:03:58,000
median for really effective
divide-and-conquer without

57
00:03:58,000 --> 00:04:02,000
having to sort.
You can solve a lot of problems

58
00:04:02,000 --> 00:04:07,000
in linear time as a result.
And we're going to cover today

59
00:04:07,000 --> 00:04:10,000
two algorithms for finding order
statistics.

60
00:04:10,000 --> 00:04:15,000
Both of them are linear time.
The first one is randomized,

61
00:04:15,000 --> 00:04:18,000
so it's only linear expected
time.

62
00:04:18,000 --> 00:04:21,000
And the second one is
worst-case linear time,

63
00:04:21,000 --> 00:04:25,000
and it will build on the
randomized version.

64
00:04:25,000 --> 00:04:31,000
Let's start with a randomize
divide-and-conquer algorithm.

65
00:04:46,000 --> 00:04:49,000
This algorithm is called
rand-select.

66
00:05:02,000 --> 00:05:06,000
And the parameters are a little
bit more than what we're used

67
00:05:06,000 --> 00:05:08,000
to.
The order statistics problem

68
00:05:08,000 --> 00:05:12,000
you're given an array A.
And here I've changed notation

69
00:05:12,000 --> 00:05:15,000
and I'm looking for the ith
smallest element,

70
00:05:15,000 --> 00:05:18,000
so i is the index I'm looking
for.

71
00:05:18,000 --> 00:05:21,000
And I'm also going to change
the problem a little bit.

72
00:05:21,000 --> 00:05:25,000
And instead of trying to find
it in the whole array,

73
00:05:25,000 --> 00:05:29,000
I'm going to look in a
particular interval of the

74
00:05:29,000 --> 00:05:33,000
array, A from p up to q.
We're going to need that for a

75
00:05:33,000 --> 00:05:36,000
recursion.
This better be a recursive

76
00:05:36,000 --> 00:05:39,000
algorithm because we're using
divide-and-conquer.

77
00:05:39,000 --> 00:05:41,000
Here is the algorithm.

78
00:05:51,000 --> 00:05:54,000
With a base case.
It's pretty simple.

79
00:05:54,000 --> 00:06:00,000
Then we're going to use part of
the quicksort algorithm,

80
00:06:00,000 --> 00:06:03,000
randomized quicksort.

81
00:06:09,000 --> 00:06:13,000
We didn't actually define this
subroutine two lectures ago,

82
00:06:13,000 --> 00:06:17,000
but you should know what it
does, especially if you've read

83
00:06:17,000 --> 00:06:20,000
the textbook.
This says in the array A[p...q]

84
00:06:20,000 --> 00:06:24,000
pick a random element,
so pick a random index between

85
00:06:24,000 --> 00:06:30,000
p and q, swap it with the first
element, then call partition.

86
00:06:30,000 --> 00:06:34,000
And partition uses that first
element to split the rest of the

87
00:06:34,000 --> 00:06:39,000
array into less than or equal to
that random partition and

88
00:06:39,000 --> 00:06:42,000
greater than or equal to that
partition.

89
00:06:42,000 --> 00:06:47,000
This is just picking a random
partition element between p and

90
00:06:47,000 --> 00:06:52,000
q, cutting the array in half,
although the two sizes may not

91
00:06:52,000 --> 00:06:54,000
be equal.
And it returns the index of

92
00:06:54,000 --> 00:07:00,000
that partition element,
some number between p and q.

93
00:07:00,000 --> 00:07:08,000
And we're going to define k to
be this particular value,

94
00:07:08,000 --> 00:07:15,000
r minus p plus 1.
And the reason for that is that

95
00:07:15,000 --> 00:07:21,000
k is then the rank of the
partition element.

96
00:07:21,000 --> 00:07:30,000
This is in A[p...q].
Let me draw a picture here.

97
00:07:30,000 --> 00:07:34,000
We have our array A.
It starts at p and ends at q.

98
00:07:34,000 --> 00:07:38,000
There is other stuff,
but for this recursive all we

99
00:07:38,000 --> 00:07:42,000
care about is p up to q.
We pick a random partition

100
00:07:42,000 --> 00:07:47,000
element, say this one,
and we partition things so that

101
00:07:47,000 --> 00:07:50,000
everything in here,
let's call this r,

102
00:07:50,000 --> 00:07:55,000
is less than or equal to A[r]
and everything up here is

103
00:07:55,000 --> 00:08:00,000
greater than or equal to A[r].
And A[r] is our partition

104
00:08:00,000 --> 00:08:03,000
element.
After this call,

105
00:08:03,000 --> 00:08:06,000
that's what the array looks
like.

106
00:08:06,000 --> 00:08:09,000
And we get r.
We get the index of where

107
00:08:09,000 --> 00:08:14,000
partition element is stored.
The number of elements that are

108
00:08:14,000 --> 00:08:20,000
less than or equal to A[r] and
including r is r minus p plus 1.

109
00:08:20,000 --> 00:08:23,000
There will be r minus p
elements here,

110
00:08:23,000 --> 00:08:28,000
and we're adding 1 to get this
element.

111
00:08:28,000 --> 00:08:32,000
And, if you start counting at
1, if this is rank 1,

112
00:08:32,000 --> 00:08:35,000
rank 2, this element will have
rank k.

113
00:08:35,000 --> 00:08:40,000
That's just from the
construction in the partition.

114
00:08:40,000 --> 00:08:46,000
And now we get to recurse.
And there are three cases --

115
00:08:53,000 --> 00:08:55,000
-- depending on how i relates
to k.

116
00:08:55,000 --> 00:08:57,000
Remember i is the rank that
we're looking for,

117
00:08:57,000 --> 00:09:01,000
k is the rank that we happen to
get out of this random

118
00:09:01,000 --> 00:09:03,000
partition.
We don't have much control over

119
00:09:03,000 --> 00:09:07,000
k, but if we're lucky i=k.
That's the element we want.

120
00:09:13,000 --> 00:09:15,000
Then we just return the
partition element.

121
00:09:15,000 --> 00:09:18,000
More likely is that the element
we're looking for is either to

122
00:09:18,000 --> 00:09:20,000
the left or to the right.
And if it's to the left we're

123
00:09:20,000 --> 00:09:23,000
going to recurse in the
left-hand portion of the array.

124
00:09:23,000 --> 00:09:26,000
And if it's to the right we're
going to recurse in the

125
00:09:26,000 --> 00:09:28,000
right-hand portion.
So, pretty straightforward at

126
00:09:28,000 --> 00:09:30,000
this point.

127
00:09:45,000 --> 00:09:48,000
I just have to get all the
indices right.

128
00:10:08,000 --> 00:10:11,000
Either we're going to recurse
on the part between p and r

129
00:10:11,000 --> 00:10:14,000
minus 1, that's this case.
The rank we're looking for is

130
00:10:14,000 --> 00:10:17,000
to the left of the rank of
element A[r].

131
00:10:17,000 --> 00:10:20,000
Or, we're going to recurse on
the right part between r plus 1

132
00:10:20,000 --> 00:10:22,000
and q.
Where we recurse on the left

133
00:10:22,000 --> 00:10:25,000
part the rank we're looking for
remains the same,

134
00:10:25,000 --> 00:10:28,000
but when we recurse on the
right part the rank we're

135
00:10:28,000 --> 00:10:33,000
looking for gets offset.
Because we sort of got rid of

136
00:10:33,000 --> 00:10:38,000
the k elements over here.
I should have written this

137
00:10:38,000 --> 00:10:42,000
length is k.
We've sort of swept away k

138
00:10:42,000 --> 00:10:46,000
ranks of elements.
And now within this array we're

139
00:10:46,000 --> 00:10:51,000
looking for the i minus kth
smallest element.

140
00:10:51,000 --> 00:10:55,000
That's the recursion.
We only recurse once.

141
00:10:55,000 --> 00:11:00,000
And random partition is not a
recursion.

142
00:11:00,000 --> 00:11:04,000
That just takes linear time.
And the total amount of work

143
00:11:04,000 --> 00:11:09,000
we're doing here should be
linear time plus one recursion.

144
00:11:09,000 --> 00:11:14,000
And we'd next like to see what
the total running time is in

145
00:11:14,000 --> 00:11:19,000
expectation, but let's first do
a little example --

146
00:11:26,000 --> 00:11:29,000
-- to make this algorithm
perfectly clear.

147
00:11:29,000 --> 00:11:33,000
Let's suppose we're looking for
the seventh smallest element in

148
00:11:33,000 --> 00:11:35,000
this array.

149
00:11:50,000 --> 00:11:53,000
And let's suppose,
just for example,

150
00:11:53,000 --> 00:11:57,000
that the pivot we're using is
just the first element.

151
00:11:57,000 --> 00:12:02,000
So, nothing fancy.
I would have to flip a few

152
00:12:02,000 --> 00:12:06,000
coins in order to generate a
random one, so let's just pick

153
00:12:06,000 --> 00:12:09,000
this one.
If I partition at the element

154
00:12:09,000 --> 00:12:13,000
6, this is actually an example
we did two weeks ago,

155
00:12:13,000 --> 00:12:17,000
and I won't go through it
again, but we get the same

156
00:12:17,000 --> 00:12:21,000
array, as we did two weeks ago,
namely 2, 5,

157
00:12:21,000 --> 00:12:23,000
3, 6, 8, 13,
10 and 11.

158
00:12:23,000 --> 00:12:26,000
If you run through the
partitioning algorithm,

159
00:12:26,000 --> 00:12:31,000
that happens to be the order
that it throws the elements

160
00:12:31,000 --> 00:12:35,000
into.
And this is our position r.

161
00:12:35,000 --> 00:12:37,000
This is p here.
It's just 1.

162
00:12:37,000 --> 00:12:40,000
And q is just the end.
And I am looking for the

163
00:12:40,000 --> 00:12:44,000
seventh smallest element.
And it happens when I run this

164
00:12:44,000 --> 00:12:48,000
partition that 6 falls into the
fourth place.

165
00:12:48,000 --> 00:12:52,000
And we know that means,
because all the elements here

166
00:12:52,000 --> 00:12:56,000
are less than 6 and all the
elements here are greater than

167
00:12:56,000 --> 00:13:00,000
6, if this array were sorted,
6 would be right here in

168
00:13:00,000 --> 00:13:05,000
position four.
So, r here is 4.

169
00:13:05,000 --> 00:13:09,000
Yeah?
The 12 turned into an 11?

170
00:13:09,000 --> 00:13:13,000
This was an 11,
believe it or not.

171
00:13:13,000 --> 00:13:16,000
Let me be simple.
Sorry.

172
00:13:16,000 --> 00:13:20,000
Sometimes my ones look like
twos.

173
00:13:20,000 --> 00:13:27,000
Not a good feature.
That's an easy way to cover.

174
00:13:27,000 --> 00:13:31,000
[LAUGHTER]
Don't try that on exams.

175
00:13:31,000 --> 00:13:33,000
Oh, that one was just a two.
No.

176
00:13:33,000 --> 00:13:37,000
Even though we're not sorting
the array, we're only spending

177
00:13:37,000 --> 00:00:06,000
linear work here to partition by

178
00:13:39,000 --> 00:13:43,000
We know that if we had sorted
the array 6 would fall here.

179
00:13:43,000 --> 00:13:46,000
We don't know about these other
elements.

180
00:13:46,000 --> 00:13:49,000
They're not in sorted order,
but from the properties of

181
00:13:49,000 --> 00:13:52,000
partition we know 6 went the
right spot.

182
00:13:52,000 --> 00:13:56,000
We now know rank of 6 is 4.
We happened to be looking for 7

183
00:13:56,000 --> 00:14:00,000
and we happened to get this
number 4.

184
00:14:00,000 --> 00:14:03,000
We want something over here.
It turns out we're looking for

185
00:14:03,000 --> 00:14:05,000
10, I guess.
No, 11.

186
00:14:05,000 --> 00:14:08,000
There should be eight elements
in this array,

187
00:14:08,000 --> 00:14:10,000
so it's the next to max.
Max here is 13,

188
00:14:10,000 --> 00:14:14,000
I'm cheating here.
The answer we're looking for is

189
00:14:16,000 --> 00:14:20,000
for is in the right-hand part
because the rank we're looking

190
00:14:20,000 --> 00:00:04,000
for is 7, which is bigger than

191
00:14:22,000 --> 00:14:25,000
Now, what rank are we looking
for in here?

192
00:14:25,000 --> 00:14:30,000
Well, we've gotten rid of four
elements over here.

193
00:14:30,000 --> 00:14:35,000
It happened here that k is also
4 because p is 1 in this

194
00:14:35,000 --> 00:14:38,000
example.
The rank of 6 was 4.

195
00:14:38,000 --> 00:14:41,000
We throw away those four
elements.

196
00:14:41,000 --> 00:14:46,000
Now we're looking for rank 7
minus 4 which is 3.

197
00:14:46,000 --> 00:14:49,000
And, indeed,
the rank 3 element here is

198
00:14:49,000 --> 00:14:53,000
still 11.
So, you recursively find that.

199
00:14:53,000 --> 00:14:58,000
That's your answer.
Now that algorithm should be

200
00:14:58,000 --> 00:15:03,000
pretty clear.
The tricky part is to analyze

201
00:15:03,000 --> 00:15:05,000
it.
And the analysis here is quite

202
00:15:05,000 --> 00:15:10,000
a bit like randomized quicksort,
although not quite as hairy,

203
00:15:10,000 --> 00:15:13,000
so it will go faster.
But it will be also sort of a

204
00:15:13,000 --> 00:15:18,000
nice review of the randomized
quicksort analysis which was a

205
00:15:18,000 --> 00:15:21,000
bit tricky and always good to
see a couple of times.

206
00:15:21,000 --> 00:15:26,000
We're going to follow the same
kind of outline as before to

207
00:15:26,000 --> 00:15:31,000
look at the expected running
time of this algorithm.

208
00:15:31,000 --> 00:15:34,000
And to start out we're going
to, as before,

209
00:15:34,000 --> 00:15:39,000
look at some intuition just to
feel good about ourselves.

210
00:15:39,000 --> 00:15:44,000
Also feel bad as you'll see.
Let's think about two sort of

211
00:15:44,000 --> 00:15:49,000
extreme cases,
a good case and the worst case.

212
00:15:49,000 --> 00:15:54,000
And I should mention that in
all of the analyses today we

213
00:15:54,000 --> 00:15:58,000
assume the elements are
distinct.

214
00:16:04,000 --> 00:16:08,000
It gets really messy if the
elements are not distinct.

215
00:16:08,000 --> 00:16:12,000
And you may even have to change
the algorithms a little bit

216
00:16:12,000 --> 00:16:16,000
because if all the elements are
equal, if you pick a random

217
00:16:16,000 --> 00:16:19,000
element, the partition does not
do so well.

218
00:16:19,000 --> 00:16:24,000
But let's assume they're all
distinct, which is the really

219
00:16:24,000 --> 00:16:28,000
interesting case.
A pretty luck case --

220
00:16:28,000 --> 00:16:32,000
I mean the best cases we
partition right in the middle.

221
00:16:32,000 --> 00:16:37,000
The number of elements to the
left of our partition is equal

222
00:16:37,000 --> 00:16:42,000
to the number of elements to the
right of our partition.

223
00:16:42,000 --> 00:16:47,000
But almost as good would be
some kind of 1/10 to 9/10 split.

224
00:16:47,000 --> 00:16:50,000
Any constant fraction,
we should feel that.

225
00:16:50,000 --> 00:16:54,000
Any constant fraction is as
good as 1/2.

226
00:16:54,000 --> 00:16:58,000
Then the recurrence we get is,
let's say at most,

227
00:16:58,000 --> 00:17:01,000
this bad.
So, it depends.

228
00:17:01,000 --> 00:17:04,000
If we have let's say 1/10 on
the left and 9/10 on the right

229
00:17:04,000 --> 00:17:08,000
every time we do a partition.
It depends where our answer is.

230
00:17:08,000 --> 00:17:12,000
It could be if i is really
small it's in the 1/10 part.

231
00:17:12,000 --> 00:17:16,000
If i is really big it's going
to be in the 9/10 part,

232
00:17:16,000 --> 00:17:19,000
or most of the time it's going
to be in the 9/10 part.

233
00:17:19,000 --> 00:17:23,000
We're doing worst-case analysis
within the lucky case,

234
00:17:23,000 --> 00:17:25,000
so we're happy to have upper
bounds.

235
00:17:25,000 --> 00:17:30,000
I will say t(n) is at most t of
T(9/10n)+Theta(n).

236
00:17:30,000 --> 00:17:34,000
Clearly it's worse if we're in
the bigger part.

237
00:17:34,000 --> 00:17:38,000
What is the solution to this
recurrence?

238
00:17:38,000 --> 00:17:42,000
Oh, solving recurrence was so
long ago.

239
00:17:42,000 --> 00:17:47,000
What method should we use for
solving this recurrence?

240
00:17:47,000 --> 00:17:51,000
The master method.
What case are we in?

241
00:17:51,000 --> 00:17:52,000
Three.
Good.

242
00:17:52,000 --> 00:17:55,000
You still remember.
This is Case 3.

243
00:17:55,000 --> 00:18:01,000
We're looking at nlog_b(a).
b here is 10/9,

244
00:18:01,000 --> 00:18:06,000
although it doesn't really
matter because a is 1.

245
00:18:06,000 --> 00:18:11,000
log base anything of 1 is 0.
So, this is n^0 which is 1.

246
00:18:11,000 --> 00:18:14,000
And n is polynomially larger
than 1.

247
00:18:14,000 --> 00:18:18,000
This is going to be O(n),
which is good.

248
00:18:18,000 --> 00:18:21,000
That is what we want,
linear time.

249
00:18:21,000 --> 00:18:25,000
If we're in the lucky case,
great.

250
00:18:25,000 --> 00:18:30,000
Unfortunately this is only
intuition.

251
00:18:30,000 --> 00:18:32,000
And we're not always going to
get the lucky case.

252
00:18:32,000 --> 00:18:35,000
We could do the same kind of
analysis as we did with

253
00:18:35,000 --> 00:18:38,000
randomized quicksort.
If you alternate between lucky

254
00:18:38,000 --> 00:18:41,000
and unlucky, things will still
be good, but let's just talk

255
00:18:41,000 --> 00:18:44,000
about the unlucky case to show
how bad things can get.

256
00:18:44,000 --> 00:18:48,000
And this really would be a
worst-case analysis.

257
00:18:53,000 --> 00:19:00,000
The unlucky case we get a split
of 0:n-1.

258
00:19:00,000 --> 00:19:04,000
Because we're removing the
partition element either way.

259
00:19:04,000 --> 00:19:09,000
And there could be nothing less
than the partition element.

260
00:19:09,000 --> 00:19:14,000
We have 0 on the left-hand side
and we have n-1 on the

261
00:19:14,000 --> 00:19:18,000
right-hand side.
Now we get a recurrence like

262
00:19:18,000 --> 00:19:23,000
T(n)=T(n-1) plus linear cost.
And what's the solution to that

263
00:19:23,000 --> 00:19:25,000
recurrence?
n^2.

264
00:19:25,000 --> 00:19:27,000
Yes.
This one you should just know.

265
00:19:27,000 --> 00:19:33,000
It's n^2 because it's an
arithmetic series.

266
00:19:38,000 --> 00:19:40,000
And that's pretty bad.
This is much,

267
00:19:40,000 --> 00:19:43,000
much worse than sorting and
then picking the ith element.

268
00:19:43,000 --> 00:19:46,000
In the worst-case this
algorithm really sucks,

269
00:19:46,000 --> 00:19:49,000
but most of the time it's going
to do really well.

270
00:19:49,000 --> 00:19:52,000
And, unless you're really,
really unlucky and every coin

271
00:19:52,000 --> 00:19:56,000
you flip gives the wrong answer,
you won't get this case and you

272
00:19:56,000 --> 00:19:58,000
will get something more like the
lucky case.

273
00:19:58,000 --> 00:20:02,000
At least that's what we'd like
to prove.

274
00:20:02,000 --> 00:20:05,000
And we will prove that the
expected running time here is

275
00:20:05,000 --> 00:20:07,000
linear.
So, it's very rare to get

276
00:20:07,000 --> 00:20:09,000
anything quadratic.
But later on we will see how to

277
00:20:09,000 --> 00:20:11,000
make the worst-case linear as
well.

278
00:20:11,000 --> 00:20:15,000
This would really,
really solve the problem.

279
00:20:30,000 --> 00:20:34,000
Let's get into the analysis.

280
00:20:43,000 --> 00:20:47,000
Now, you've seen an analysis
much like this before.

281
00:20:47,000 --> 00:20:51,000
What do you suggest we do in
order to analyze this expected

282
00:20:51,000 --> 00:20:54,000
time?
It's a divide-and-conquer

283
00:20:54,000 --> 00:20:59,000
algorithm, so we kind of like to
write down the recurrence on

284
00:20:59,000 --> 00:21:03,000
something resembling the running
time.

285
00:21:09,000 --> 00:21:12,000
I don't need the answer,
but what's the first step that

286
00:21:12,000 --> 00:21:16,000
we might do to analyze the
expected running time of this

287
00:21:16,000 --> 00:21:18,000
algorithm?
Sorry?

288
00:21:18,000 --> 00:21:20,000
Look at different cases,
yeah.

289
00:21:20,000 --> 00:21:22,000
Exactly.
We have all these possible ways

290
00:21:22,000 --> 00:21:25,000
that random partition could
split.

291
00:21:25,000 --> 00:21:30,000
It could split 0 to the n-1.
It could split in half.

292
00:21:30,000 --> 00:21:33,000
There are n choices where it
could split.

293
00:21:33,000 --> 00:21:35,000
How can we break into those
cases?

294
00:21:35,000 --> 00:21:38,000
Indicator random variables.
Cool.

295
00:21:38,000 --> 00:21:41,000
Exactly.
That's what we want to do.

296
00:21:41,000 --> 00:21:46,000
Indicator random variable
suggests that what we're dealing

297
00:21:46,000 --> 00:21:50,000
with is not exactly just a
function T(n) but it's a random

298
00:21:50,000 --> 00:21:53,000
variable.
This is one subtlety.

299
00:21:53,000 --> 00:21:57,000
T(n) depends on the random
choices, so it's really a random

300
00:21:57,000 --> 00:22:00,000
variable.

301
00:22:05,000 --> 00:22:08,000
And then we're going to use
indicator random variables to

302
00:22:08,000 --> 00:22:10,000
get a recurrence on T(n).

303
00:22:25,000 --> 00:22:32,000
So, T(n) is the running time of
rand-select on an input of size

304
00:22:32,000 --> 00:22:33,000
n.

305
00:22:40,000 --> 00:22:46,000
And I am also going to write
down explicitly an assumption

306
00:22:46,000 --> 00:22:49,000
about the random numbers.

307
00:22:55,000 --> 00:23:00,000
That they should be chosen
independently from each other.

308
00:23:00,000 --> 00:23:03,000
Every time I call random
partition, it's generating a

309
00:23:03,000 --> 00:23:07,000
completely independent random
number from all the other times

310
00:23:07,000 --> 00:23:10,000
I call random partition.
That is important,

311
00:23:10,000 --> 00:23:12,000
of course, for this analysis to
work.

312
00:23:12,000 --> 00:23:15,000
We will see why some point down
the line.

313
00:23:15,000 --> 00:23:19,000
And now, to sort of write down
an equation for T(n) we're going

314
00:23:19,000 --> 00:23:24,000
to define indicator random
variables, as you suggested.

315
00:23:36,000 --> 00:23:44,000
And we will call it X_k.
And this is for all k=0...n-1.

316
00:23:50,000 --> 00:23:54,000
Indicator random variables
either 1 or 0.

317
00:23:54,000 --> 00:24:00,000
And it's going to be 1 if the
partition comes out k on the

318
00:24:00,000 --> 00:24:06,000
left-hand side.
So say the partition generates

319
00:24:06,000 --> 00:24:11,000
a k:n-k-1 split and it is 0
otherwise.

320
00:24:11,000 --> 00:24:17,000
We have n of these indicator
random variables between

321
00:24:17,000 --> 00:24:20,000
0...n-1.
And in each case,

322
00:24:20,000 --> 00:24:27,000
no matter how the random choice
comes out, exactly one of them

323
00:24:27,000 --> 00:24:32,000
will be 1.
All the others will be 0.

324
00:24:32,000 --> 00:24:37,000
Now we can divide out the
running time of this algorithm

325
00:24:37,000 --> 00:24:40,000
based on which case we're in.

326
00:24:49,000 --> 00:24:57,000
That will sort of unify this
intuition that we did and get

327
00:24:57,000 --> 00:25:02,000
all the cases.
And then we can look at the

328
00:25:02,000 --> 00:25:08,000
expectation.
T(n), if we just split out by

329
00:25:08,000 --> 00:25:15,000
cases, we have an upper bound
like this.

330
00:25:28,000 --> 00:25:33,000
If we have 0 to n-1 split,
the worst is we have n-1.

331
00:25:33,000 --> 00:25:38,000
Then we have to recurse in a
problem of size n-1.

332
00:25:38,000 --> 00:25:43,000
In fact, it would be pretty
hard to recurse in a problem of

333
00:25:43,000 --> 00:25:47,000
size 0.
If we have a 1 to n-2 split

334
00:25:47,000 --> 00:25:51,000
then we take the max of the two
sides.

335
00:25:51,000 --> 00:25:58,000
That's certainly going to give
us an upper bound and so on.

336
00:26:03,000 --> 00:26:06,000
And at the bottom you get an
n-1 to 0 split.

337
00:26:14,000 --> 00:26:16,000
This is now sort of
conditioning on various events,

338
00:26:16,000 --> 00:26:19,000
but we have indicator random
variables to tell us when these

339
00:26:19,000 --> 00:26:21,000
events happen.
We can just multiply each of

340
00:26:21,000 --> 00:26:25,000
these values by the indicator
random variable and it will come

341
00:26:25,000 --> 00:26:28,000
out 0 if that's not the case and
will come out 1 and give us this

342
00:26:28,000 --> 00:26:31,000
value if that happens to be the
split.

343
00:26:31,000 --> 00:26:37,000
So, if we add up all of those
we'll get the same thing.

344
00:26:37,000 --> 00:26:45,000
This is equal to the sum over
all k of the indicator random

345
00:26:45,000 --> 00:26:52,000
variable times the cost in that
case, which is t of max k,

346
00:26:52,000 --> 00:26:57,000
and the other side,
which is n-k-1,

347
00:26:57,000 --> 00:27:01,000
plus theta n.
This is our recurrence,

348
00:27:01,000 --> 00:27:04,000
in some sense,
for the random variable

349
00:27:04,000 --> 00:27:09,000
representing running time.
Now, the value will depend on

350
00:27:09,000 --> 00:27:13,000
which case we come into.
We know the probability of each

351
00:27:13,000 --> 00:27:19,000
of these events happening is the
same because we're choosing the

352
00:27:19,000 --> 00:27:23,000
partition element uniformly at
random, but we cannot really

353
00:27:23,000 --> 00:27:29,000
simplify much beyond this until
we take expectations.

354
00:27:29,000 --> 00:27:32,000
We know this random variable
could be as big as n^2.

355
00:27:32,000 --> 00:27:37,000
Hopefully it's usually linear.
We will take expectations of

356
00:27:37,000 --> 00:27:40,000
both sides and get what we want.

357
00:27:54,000 --> 00:27:58,000
Let's look at the expectation
of this random variable,

358
00:27:58,000 --> 00:28:02,000
which is just the expectation,
I will copy over,

359
00:28:02,000 --> 00:28:07,000
summation we have here so I can
work on this board.

360
00:28:30,000 --> 00:28:33,000
I want to compute the
expectation of this summation.

361
00:28:33,000 --> 00:28:36,000
What property of expectation
should I use?

362
00:28:36,000 --> 00:28:39,000
Linearity, good.
We can bring the summation

363
00:28:39,000 --> 00:28:41,000
outside.

364
00:29:08,000 --> 00:29:09,000
Now I have a sum of
expectation.

365
00:29:09,000 --> 00:29:12,000
Let's look at each expectation
individually.

366
00:29:12,000 --> 00:29:15,000
It's a product of two random
variables, if you will.

367
00:29:15,000 --> 00:29:19,000
This is an indicator random
variable and this is some more

368
00:29:19,000 --> 00:29:22,000
complicated function,
some more complicated random

369
00:29:22,000 --> 00:29:24,000
variable representing some
running time,

370
00:29:24,000 --> 00:29:28,000
which depends on what random
choices are made in that

371
00:29:28,000 --> 00:29:31,000
recursive call.
Now what should I do?

372
00:29:31,000 --> 00:29:37,000
I have the expectation of the
product of two random variables.

373
00:29:37,000 --> 00:29:39,000
Independence,
exactly.

374
00:29:39,000 --> 00:29:45,000
If I know that these two random
variables are independent then I

375
00:29:45,000 --> 00:29:51,000
know that the expectation of the
product is the product of the

376
00:29:51,000 --> 00:29:55,000
expectations.
Now we have to check are they

377
00:29:55,000 --> 00:29:58,000
independent?
I hope so because otherwise

378
00:29:58,000 --> 00:30:04,000
there isn't much else I can do.
Why are they independent?

379
00:30:04,000 --> 00:30:07,000
Sorry?
Because we stated that they

380
00:30:07,000 --> 00:30:10,000
are, right.
Because of this assumption.

381
00:30:10,000 --> 00:30:14,000
We assume that all the random
numbers are chosen

382
00:30:14,000 --> 00:30:17,000
independently.
We need to sort of interpolate

383
00:30:17,000 --> 00:30:19,000
that here.
These X_k's,

384
00:30:19,000 --> 00:30:21,000
all the X_k's,
X_0 up to X_n-1,

385
00:30:21,000 --> 00:30:26,000
so all the ones appearing in
this summation are dependent

386
00:30:26,000 --> 00:30:30,000
upon a single random choice of
this particular call to random

387
00:30:30,000 --> 00:30:36,000
partition.
All of these are correlated,

388
00:30:36,000 --> 00:30:44,000
because if one of them is 1,
all the others are forced to be

389
00:30:47,000 --> 00:30:54,000
correlation among the X_k's.
But with respect to everything

390
00:30:54,000 --> 00:31:00,000
that is in here,
and the only random part is

391
00:31:00,000 --> 00:31:07,000
this T(max(kn-k-1)).
That is the reason that this

392
00:31:07,000 --> 00:31:12,000
random variable is independent
from these.

393
00:31:12,000 --> 00:31:19,000
The same thing as quicksort,
but I know some people got

394
00:31:19,000 --> 00:31:24,000
confused about it a couple
lectures ago so I am

395
00:31:24,000 --> 00:31:29,000
reiterating.
We get the product of

396
00:31:29,000 --> 00:31:35,000
expectations,
E[X_k] E[T(max(kn-k-1))].

397
00:31:35,000 --> 00:31:40,000
I mean the order n comes
outside, but let's leave it

398
00:31:40,000 --> 00:31:44,000
inside for now.
There is no expectation to

399
00:31:44,000 --> 00:31:49,000
compute there for order n.
Order n is order n.

400
00:31:49,000 --> 00:31:55,000
What is the expectation of X_k?
1/n, because they're all chosen

401
00:31:55,000 --> 00:32:00,000
with equal probability.
There is n of them,

402
00:32:00,000 --> 00:32:04,000
so the expectation is 1/n.
The value is either 1 or 0.

403
00:32:04,000 --> 00:32:07,000
We start to be able to split
this up.

404
00:32:07,000 --> 00:32:12,000
We have 1/n times this expected
value of some recursive T call,

405
00:32:12,000 --> 00:32:15,000
and then we have plus 1 over n
times order n,

406
00:32:15,000 --> 00:32:20,000
also known as a constant,
but everything is summed up n

407
00:32:20,000 --> 00:32:23,000
times so let's expand this.

408
00:32:35,000 --> 00:32:42,000
I have the sum k=0 to n-1.
I guess the 1/n can come

409
00:32:42,000 --> 00:32:47,000
outside.
And we have expectation of

410
00:32:47,000 --> 00:32:54,000
[T(max(kn-k-1))].
Lots of nifty braces there.

411
00:32:54,000 --> 00:32:59,000
And then plus we have,
on the other hand,

412
00:32:59,000 --> 00:33:06,000
the sum k=0 to n-1.
Let me just write that out

413
00:33:06,000 --> 00:33:08,000
again.
We have a 1/n in front and we

414
00:33:08,000 --> 00:33:12,000
have a Theta(n) inside.
This summation is n^2.

415
00:33:12,000 --> 00:33:16,000
And then we're dividing by n,
so this whole thing is,

416
00:33:16,000 --> 00:33:20,000
again, order n.
Nothing fancy happened there.

417
00:33:20,000 --> 00:33:25,000
This is really just saying the
expectation of order n is order

418
00:33:25,000 --> 00:33:27,000
n.
Average value of order n is

419
00:33:27,000 --> 00:33:31,000
order n.
What is interesting is this

420
00:33:31,000 --> 00:33:35,000
part.
Now, what could we do with this

421
00:33:35,000 --> 00:33:38,000
summation?
Here we start to differ from

422
00:33:38,000 --> 00:33:43,000
randomized quicksort because we
have this max.

423
00:33:43,000 --> 00:33:48,000
Randomized quicksort we had the
sum of T(k) plus T(n-k-1)

424
00:33:48,000 --> 00:33:52,000
because we were making both
recursive calls.

425
00:33:52,000 --> 00:33:56,000
Here we're only making the
biggest one.

426
00:33:56,000 --> 00:34:03,000
That max is really a pain for
evaluating this recurrence.

427
00:34:03,000 --> 00:34:11,000
How could I get rid of the max?
That's one way to think of it.

428
00:34:11,000 --> 00:34:13,000
Yeah?

429
00:34:18,000 --> 00:34:20,000
Exactly.
I could only sum up to halfway

430
00:34:20,000 --> 00:34:23,000
and then double.
In other words,

431
00:34:23,000 --> 00:34:26,000
terms are getting repeated
twice here.

432
00:34:26,000 --> 00:34:30,000
When k=0 or when k=n-1,
I get the same T(n-1).

433
00:34:30,000 --> 00:34:33,000
When k=1 or n-2,
I get the same thing,

434
00:34:33,000 --> 00:34:37,000
2 and n-3.
What I will actually do is sum

435
00:34:37,000 --> 00:34:42,000
from halfway up.
That's a little bit cleaner.

436
00:34:42,000 --> 00:34:45,000
And let me get the indices
right.

437
00:34:45,000 --> 00:34:49,000
Floor of n/2 up to n-1 will be
safe.

438
00:34:49,000 --> 00:34:55,000
And then I just have E[T(k)],
except I forgot to multiply by

439
00:34:55,000 --> 00:35:01,000
2, so I'm going to change this 1
to a 2.

440
00:35:01,000 --> 00:35:04,000
And order n is preserved.
This is just because each term

441
00:35:04,000 --> 00:35:07,000
is appearing twice.
I can factor it out.

442
00:35:07,000 --> 00:35:10,000
And if n is odd,
I'm actually double-counting

443
00:35:10,000 --> 00:35:13,000
somewhat, but it's certain at
most that.

444
00:35:13,000 --> 00:35:17,000
So, that's a safe upper bound.
And upper bounds are all we

445
00:35:17,000 --> 00:35:20,000
care about because we're hoping
to get linear.

446
00:35:20,000 --> 00:35:24,000
And the running time of this
algorithm is definitely at least

447
00:35:24,000 --> 00:35:29,000
linear, so we just need an upper
bounded linear.

448
00:35:29,000 --> 00:35:32,000
So, this is a recurrence.
E[T(n)] is at most 2/n times

449
00:35:32,000 --> 00:35:36,000
the sum of half the numbers
between 0 and n of

450
00:35:36,000 --> 00:35:39,000
E[T(k)]+Theta(n).
It's a bit of hairy recurrence.

451
00:35:39,000 --> 00:35:41,000
We want to solve it,
though.

452
00:35:41,000 --> 00:35:45,000
And it's actually a little bit
easier than the randomized

453
00:35:45,000 --> 00:35:48,000
quicksort recurrence.
We're going to solve it.

454
00:35:48,000 --> 00:35:51,000
What method should we use?
Sorry?

455
00:35:51,000 --> 00:35:53,000
Master method?
Master would be nice,

456
00:35:53,000 --> 00:35:57,000
except that each of the
recursive calls is with a

457
00:35:57,000 --> 00:36:01,000
different value of k.
The master method only works

458
00:36:01,000 --> 00:36:05,000
when all the calls are with the
same value, same size.

459
00:36:05,000 --> 00:36:09,000
Alas, it would be nice if we
could use the master method.

460
00:36:09,000 --> 00:36:11,000
What else do we have?
Substitution.

461
00:36:11,000 --> 00:36:13,000
When it's hard,
when in doubt,

462
00:36:13,000 --> 00:36:16,000
use substitution.
I mean the good thing here is

463
00:36:16,000 --> 00:36:20,000
we know what we want.
From the intuition at least,

464
00:36:20,000 --> 00:36:23,000
which is now erased,
we really feel that this should

465
00:36:23,000 --> 00:36:26,000
be linear time.
So, we know what we want to

466
00:36:26,000 --> 00:36:31,000
prove.
And indeed we can prove it just

467
00:36:31,000 --> 00:36:35,000
directly with substitution.

468
00:36:42,000 --> 00:36:46,000
I want to claim there is some
constant c greater than zero

469
00:36:46,000 --> 00:36:49,000
such that E[T(n)],
according to this recurrence,

470
00:36:49,000 --> 00:36:54,000
is at most c times n.
Let's prove that over here.

471
00:37:00,000 --> 00:37:04,000
As we guessed,
the proof is by substitution.

472
00:37:13,000 --> 00:37:18,000
What that means is we're going
to assume, by induction,

473
00:37:18,000 --> 00:37:22,000
that this inequality is true
for all smaller m.

474
00:37:22,000 --> 00:37:28,000
I will just say 4 less than n.
And we need to prove it for n.

475
00:37:28,000 --> 00:37:33,000
We get E[T(n)].
Now we are just going to expand

476
00:37:33,000 --> 00:37:36,000
using the recurrence that we
have.

477
00:37:36,000 --> 00:37:40,000
It's at most this.
I will copy that over.

478
00:37:54,000 --> 00:37:57,000
And then each of these
recursive calls is with some

479
00:37:57,000 --> 00:38:00,000
value k that is strictly smaller
than n.

480
00:38:00,000 --> 00:38:03,000
Sorry, I copied it wrong,
floor of n over 2,

481
00:38:03,000 --> 00:38:07,000
not zero.
And so I can apply the

482
00:38:07,000 --> 00:38:11,000
induction hypothesis to each of
these.

483
00:38:11,000 --> 00:38:16,000
This is at most c times k by
the induction hypothesis.

484
00:38:16,000 --> 00:38:20,000
And so I get this inequality.

485
00:38:37,000 --> 00:38:40,000
This c can come outside the
summation because it's just a

486
00:38:40,000 --> 00:38:43,000
constant.
And I will be slightly tedious

487
00:38:43,000 --> 00:38:47,000
in writing this down again,
because what I care about is

488
00:38:47,000 --> 00:38:50,000
the summation here that is left
over.

489
00:38:56,000 --> 00:39:01,000
This is a good old-fashioned
summation.

490
00:39:01,000 --> 00:39:04,000
And if you remember back to
your summation tricks or

491
00:39:04,000 --> 00:39:07,000
whatever, you should be able to
evaluate this.

492
00:39:07,000 --> 00:39:11,000
If we started at zero and went
up to n minus 1,

493
00:39:11,000 --> 00:39:14,000
that's just an arithmetic
series, but here we have the

494
00:39:14,000 --> 00:39:16,000
tail end of an arithmetic
series.

495
00:39:16,000 --> 00:39:19,000
And you should know,
at least up to theta,

496
00:39:19,000 --> 00:39:21,000
what this is,
right?

497
00:39:21,000 --> 00:39:23,000
n^2, yeah.
It's definitely T(n^2).

498
00:39:23,000 --> 00:39:26,000
But we need here a slightly
better upper bond,

499
00:39:26,000 --> 00:39:31,000
as we will see the constants
really matter.

500
00:39:31,000 --> 00:39:35,000
What we're going to use is that
this summation is at most 3/8

501
00:39:35,000 --> 00:39:38,000
times n^2.
And that will be critical,

502
00:39:38,000 --> 00:39:41,000
the fact that 3/8 is smaller
than 1/2, I believe.

503
00:39:41,000 --> 00:39:44,000
So it's going to get rid of
this 2.

504
00:39:44,000 --> 00:39:47,000
I am not going to prove this.
This is an exercise.

505
00:39:47,000 --> 00:39:52,000
When you know that it is true,
it's easy because you can just

506
00:39:52,000 --> 00:39:55,000
prove it by induction.
Figuring out that number is a

507
00:39:55,000 --> 00:40:00,000
little bit more work,
but not too much more.

508
00:40:00,000 --> 00:40:04,000
So you should prove that by
induction.

509
00:40:04,000 --> 00:40:09,000
Now let me simplify.
This is a bit messy,

510
00:40:09,000 --> 00:40:15,000
but what I want is c times n.
Let's write it as our desired

511
00:40:15,000 --> 00:40:22,000
value minus the residual.
And here we have some crazy

512
00:40:22,000 --> 00:40:26,000
fractions.
This is 2 times 3 which is 6

513
00:40:26,000 --> 00:40:31,000
over 8 which is 3/4,
right?

514
00:40:31,000 --> 00:40:34,000
Here we have 1,
so we have to subtract up 1/4

515
00:40:34,000 --> 00:40:37,000
to get 3/4.
And this should be,

516
00:40:37,000 --> 00:40:42,000
I guess, 1/4 times c times n.
And then we have this theta n

517
00:40:42,000 --> 00:40:45,000
with double negation becomes a
plus theta n.

518
00:40:45,000 --> 00:40:49,000
That should be clear.
I am just rewriting that.

519
00:40:49,000 --> 00:40:52,000
So we have what we want over
here.

520
00:40:52,000 --> 00:40:57,000
And then we hope that this is
nonnegative because what we want

521
00:40:57,000 --> 00:41:03,000
is that this less than or equal
to c times n.

522
00:41:03,000 --> 00:41:06,000
That will be true,
provided this thing is

523
00:41:06,000 --> 00:41:09,000
nonnegative.
And it looks pretty good

524
00:41:09,000 --> 00:41:13,000
because we're free to choose c
however large we want.

525
00:41:13,000 --> 00:41:17,000
Whatever constant is imbedded
in this beta notation is one

526
00:41:17,000 --> 00:41:21,000
fixed constant,
whatever makes this recurrence

527
00:41:21,000 --> 00:41:24,000
true.
We just set c to be bigger than

528
00:41:24,000 --> 00:41:28,000
4 times that constant and then
this will be nonnegative.

529
00:41:28,000 --> 00:41:32,000
So this is true for c
sufficiently large to dwarf that

530
00:41:32,000 --> 00:41:36,000
theta constant.
It's also the base case.

531
00:41:36,000 --> 00:41:41,000
I just have to make the cursory
mention that we choose c large

532
00:41:41,000 --> 00:41:45,000
enough so that this claim is
true, even in the base case

533
00:41:45,000 --> 00:41:48,000
where n is at most some
constant.

534
00:41:48,000 --> 00:41:52,000
Here it's like 1 or so because
then we're not making a

535
00:41:52,000 --> 00:41:55,000
recursive call.
What we get --

536
00:41:55,000 --> 00:41:59,000
This algorithm,
randomize select,

537
00:41:59,000 --> 00:42:05,000
has expected running time order
n, Theta(n).

538
00:42:12,000 --> 00:42:15,000
The annoying this is that in
the worst-case,

539
00:42:15,000 --> 00:42:19,000
if you're really,
really unlucky it's n^2.

540
00:42:19,000 --> 00:42:23,000
Any questions before we move on
from this point?

541
00:42:23,000 --> 00:42:29,000
This finished off the proof of
this fact that we have Theta(n)

542
00:42:29,000 --> 00:42:32,000
expected time.
We already saw the n^2

543
00:42:32,000 --> 00:42:34,000
worst-case.
All perfectly clear?

544
00:42:34,000 --> 00:42:37,000
Good.
You should go over these

545
00:42:37,000 --> 00:42:39,000
proofs.
They're intrinsically related

546
00:42:39,000 --> 00:42:43,000
between randomized quicksort and
randomized select.

547
00:42:43,000 --> 00:42:47,000
Know them in your heart.
This is a great algorithm that

548
00:42:47,000 --> 00:42:52,000
works really well in practice
because most of the time you're

549
00:42:52,000 --> 00:42:54,000
going to split,
say, in the middle,

550
00:42:54,000 --> 00:43:00,000
somewhere between a 1/4 and 3/4
and everything is good.

551
00:43:00,000 --> 00:43:03,000
It's extremely unlikely that
you get the n^2 worst-case.

552
00:43:03,000 --> 00:43:06,000
It would have to happen with
like 1 over n^n probability or

553
00:43:06,000 --> 00:43:08,000
something really,
really small.

554
00:43:08,000 --> 00:43:10,000
But I am a theoretician at
least.

555
00:43:10,000 --> 00:43:14,000
And it would be really nice if
you could get Theta(n) in the

556
00:43:14,000 --> 00:43:16,000
worst-case.
That would be the cleanest

557
00:43:16,000 --> 00:43:19,000
result that you could hope for
because that's optimal.

558
00:43:19,000 --> 00:43:21,000
You cannot do better than
Theta(n).

559
00:43:21,000 --> 00:43:23,000
You've got to look at the
elements.

560
00:43:23,000 --> 00:43:25,000
So, you might ask,
can we get rid of this

561
00:43:25,000 --> 00:43:29,000
worst-case behavior and somehow
avoid randomization and

562
00:43:29,000 --> 00:43:33,000
guarantee Theta(n) worst-case
running time?

563
00:43:33,000 --> 00:43:39,000
And you can but it's a rather
nontrivial algorithm.

564
00:43:39,000 --> 00:43:45,000
And this is going to be one of
the most sophisticated that

565
00:43:45,000 --> 00:43:51,000
we've seen so far.
It won't continue to be the

566
00:43:51,000 --> 00:43:58,000
most sophisticated algorithm we
will see, but here it is.

567
00:43:58,000 --> 00:44:04,000
Worst-case linear time order
statistics.

568
00:44:09,000 --> 00:44:22,000
And this is an algorithm by
several, all very famous people,

569
00:44:22,000 --> 00:44:32,000
Blum, Floyd,
Pratt, Rivest and Tarjan.

570
00:44:32,000 --> 00:44:35,000
I think I've only met the B and
the R and the T.

571
00:44:35,000 --> 00:44:39,000
Oh, no, I've met Pratt as well.
I'm getting close to all the

572
00:44:39,000 --> 00:44:42,000
authors.
This is a somewhat old result,

573
00:44:42,000 --> 00:44:46,000
but at the time it was a major
breakthrough and still is an

574
00:44:46,000 --> 00:44:50,000
amazing algorithm.
Ron Rivest is a professor here.

575
00:44:50,000 --> 00:44:52,000
You should know him from the R
in RSA.

576
00:44:52,000 --> 00:44:56,000
When I took my PhD
comprehensives some time ago,

577
00:44:56,000 --> 00:45:00,000
on the cover sheet was a joke
question.

578
00:45:00,000 --> 00:45:04,000
It asked of the authors of the
worst-case linear time order

579
00:45:04,000 --> 00:45:08,000
statistics algorithm,
which of them is the most rich?

580
00:45:08,000 --> 00:45:13,000
Sadly it was not a graded part
of the comprehensive exam,

581
00:45:13,000 --> 00:45:18,000
but it was an amusing question.
I won't answer it here because

582
00:45:18,000 --> 00:45:21,000
we're on tape,
[LAUGHTER] but think about it.

583
00:45:21,000 --> 00:45:25,000
I may not be obvious.
Several of them are rich.

584
00:45:25,000 --> 00:45:30,000
It's just the question of who
is the most rich.

585
00:45:30,000 --> 00:45:33,000
Anyway, before they were rich
they came up with this

586
00:45:33,000 --> 00:45:35,000
algorithm.
They've come up with many

587
00:45:35,000 --> 00:45:38,000
algorithms since,
even after getting rich,

588
00:45:38,000 --> 00:45:42,000
believe it or not.
What we want is a good pivot,

589
00:45:42,000 --> 00:45:45,000
guaranteed good pivot.
Random pivot is going to be

590
00:45:45,000 --> 00:45:48,000
really good.
And so the simplest algorithm

591
00:45:48,000 --> 00:45:52,000
is just pick a random pivot.
It's going to be good with high

592
00:45:52,000 --> 00:45:55,000
probability.
We want to force a good pivot

593
00:45:55,000 --> 00:45:58,000
deterministically.
And the new idea here is we're

594
00:45:58,000 --> 00:46:02,000
going to generate it
recursively.

595
00:46:02,000 --> 00:46:04,000
What else could we do but
recurse?

596
00:46:04,000 --> 00:46:08,000
Well, you should know from your
recurrences that if we did two

597
00:46:08,000 --> 00:46:12,000
recursive calls on problems of
half the size and we have a

598
00:46:12,000 --> 00:46:16,000
linear extra work that's the
mergesort recurrence,

599
00:46:16,000 --> 00:46:20,000
T(n)=2[T(n/2)+Theta(n)].
You should recite in your

600
00:46:20,000 --> 00:46:21,000
sleep.
That's n lg n.

601
00:46:21,000 --> 00:46:25,000
So we cannot recurse on two
problems of half the size.

602
00:46:25,000 --> 00:46:30,000
We've got to do better.
Somehow these recursions have

603
00:46:30,000 --> 00:46:32,000
to add up to strictly less than
n.

604
00:46:32,000 --> 00:46:35,000
That's the magic of this
algorithm.

605
00:46:35,000 --> 00:46:39,000
So this will just be called
select instead of rand-select.

606
00:46:39,000 --> 00:46:44,000
And it really depends on an
array, but I will focus on the

607
00:46:44,000 --> 00:46:48,000
i-th element that we want to
select and the size of the array

608
00:46:48,000 --> 00:46:53,000
that we want to select in.
And I am going to write this

609
00:46:53,000 --> 00:46:57,000
algorithm slightly less formally
than randomize-select because

610
00:46:57,000 --> 00:47:02,000
it's a bit higher level of an
algorithm.

611
00:47:22,000 --> 00:47:31,000
And let me draw over here the
picture of the algorithm.

612
00:47:31,000 --> 00:47:36,000
The first step is sort of the
weirdest and it's one of the key

613
00:47:36,000 --> 00:47:38,000
ideas.
You take your elements,

614
00:47:38,000 --> 00:47:43,000
and they are in no particular
order, so instead of drawing

615
00:47:43,000 --> 00:47:47,000
them on a line,
I am going to draw them in a 5

616
00:47:47,000 --> 00:47:49,000
by n over 5 grid.
Why not?

617
00:47:49,000 --> 00:47:54,000
This, unfortunately,
take a little while to draw,

618
00:47:54,000 --> 00:48:00,000
but it will take you equally
long so I will take my time.

619
00:48:00,000 --> 00:48:02,000
It doesn't really matter what
the width is,

620
00:48:02,000 --> 00:48:06,000
but it should be width n over 5
so make sure you draw your

621
00:48:06,000 --> 00:48:08,000
figure accordingly.
Width n over 5,

622
00:48:08,000 --> 00:48:10,000
but the height should be
exactly 5.

623
00:48:10,000 --> 00:48:13,000
I think I got it right.
I can count that high.

624
00:48:13,000 --> 00:48:15,000
Here is 5.
And this should be,

625
00:48:15,000 --> 00:48:17,000
well, you know,
our number may not be divisible

626
00:48:17,000 --> 00:48:20,000
by 5, so maybe it ends off in
sort of an odd way.

627
00:48:20,000 --> 00:48:24,000
But what I would like is that
these chunks should be floor of

628
00:48:24,000 --> 00:48:26,000
n over 5.
And then we will have,

629
00:48:26,000 --> 00:48:30,000
at most, four elements left
over.

630
00:48:30,000 --> 00:48:33,000
So I am going to ignore those.
They don't really matter.

631
00:48:33,000 --> 00:48:36,000
It's just an additive constant.
Here is my array.

632
00:48:36,000 --> 00:48:39,000
I just happened to write it in
this funny way.

633
00:48:39,000 --> 00:48:42,000
And I will call these vertical
things groups.

634
00:48:42,000 --> 00:48:45,000
I would circle them,
and I did that in my notes,

635
00:48:45,000 --> 00:48:49,000
but things get really messy if
you start circling.

636
00:48:49,000 --> 00:48:53,000
This diagram is going to get
really full, just to warn you.

637
00:48:53,000 --> 00:48:55,000
By the end it will be almost
unintelligible,

638
00:48:55,000 --> 00:49:00,000
but there it is.
If you are really feeling

639
00:49:00,000 --> 00:49:03,000
bored, you can draw this a few
times.

640
00:49:03,000 --> 00:49:06,000
And you should draw how it
grows.

641
00:49:06,000 --> 00:49:10,000
So there are the groups,
vertical groups of five.

642
00:49:10,000 --> 00:49:12,000
Next step.

643
00:49:18,000 --> 00:49:24,000
The second step is to recurse.
This is where things are a bit

644
00:49:24,000 --> 00:49:28,000
unusual, well,
even more unusual.

645
00:49:28,000 --> 00:49:32,000
Oops, sorry.
I really should have had a line

646
00:49:32,000 --> 00:49:37,000
between one and two so I am
going to have to move this down

647
00:49:37,000 --> 00:49:40,000
and insert it here.
I also, in step one,

648
00:49:40,000 --> 00:49:44,000
want to find the median of each
group.

649
00:49:53,000 --> 00:49:56,000
What I would like to do is just
imagine this figure,

650
00:49:56,000 --> 00:49:59,000
each of the five elements in
each group gets reorganized so

651
00:49:59,000 --> 00:50:02,000
that the middle one is the
median.

652
00:50:02,000 --> 00:50:05,000
So I am going to call these the
medians of each group.

653
00:50:05,000 --> 00:50:10,000
I have five elements so the
median is right in the middle.

654
00:50:10,000 --> 00:50:13,000
There are two elements less
than the median,

655
00:50:13,000 --> 00:50:15,000
two elements greater than the
median.

656
00:50:15,000 --> 00:50:19,000
Again, we're assuming all
elements are distinct.

657
00:50:19,000 --> 00:50:21,000
So there they are.
I compute them.

658
00:50:21,000 --> 00:50:24,000
How long does that take me?
N over five groups,

659
00:50:24,000 --> 00:50:30,000
each with five elements,
compute the median of each one?

660
00:50:30,000 --> 00:50:32,000
Sorry?
Yeah, 2 times n over 5.

661
00:50:32,000 --> 00:50:34,000
It's theta n,
that's all I need to know.

662
00:50:34,000 --> 00:50:38,000
I mean, you're counting
comparisons, which is good.

663
00:50:38,000 --> 00:50:42,000
It's definitely Theta(n).
The point is within each group,

664
00:50:42,000 --> 00:50:46,000
I only have to do a constant
number of comparisons because

665
00:50:46,000 --> 00:50:48,000
it's a constant number of
elements.

666
00:50:48,000 --> 00:50:51,000
It doesn't matter.
You could use randomize select

667
00:50:51,000 --> 00:50:54,000
for all I care.
No matter what you do,

668
00:50:54,000 --> 00:50:59,000
it can only take a constant
number of comparisons.

669
00:50:59,000 --> 00:51:03,000
As long as you don't make a
comparison more than once.

670
00:51:03,000 --> 00:51:07,000
So this is easy.
You could sort the five numbers

671
00:51:07,000 --> 00:51:12,000
and then look at the third one,
it doesn't matter because there

672
00:51:12,000 --> 00:51:16,000
are only five of them.
That's one nifty idea.

673
00:51:16,000 --> 00:51:21,000
Already we have some elements
that are sort of vaguely in the

674
00:51:21,000 --> 00:51:25,000
middle but just of the group.
And we've only done linear

675
00:51:25,000 --> 00:51:29,000
work.
So doing well so far.

676
00:51:29,000 --> 00:51:33,000
Now we get to the second step,
which I started to write

677
00:51:33,000 --> 00:51:36,000
before, where we recurse.

678
00:51:58,000 --> 00:52:01,000
So the next idea is,
well, we have these floor over

679
00:52:01,000 --> 00:52:04,000
n over 5 medians.
I am going to compute the

680
00:52:04,000 --> 00:52:07,000
median of those medians.
I am imagining that I

681
00:52:07,000 --> 00:52:09,000
rearranged these.
And, unfortunately,

682
00:52:09,000 --> 00:52:11,000
it's an even number,
there are six of them,

683
00:52:11,000 --> 00:52:15,000
but I will rearrange so that
this guy, I have drawn in a

684
00:52:15,000 --> 00:52:18,000
second box, is the median of
these elements so that these two

685
00:52:18,000 --> 00:52:22,000
elements are strictly less than
this guy, these three elements

686
00:52:22,000 --> 00:52:24,000
are strictly greater than this
guy.

687
00:52:24,000 --> 00:52:27,000
Now, that doesn't directly tell
me anything, it would seem,

688
00:52:27,000 --> 00:52:31,000
about any of the elements out
here.

689
00:52:31,000 --> 00:52:35,000
We will come back to that.
In fact, it does tell us about

690
00:52:35,000 --> 00:52:38,000
some of the elements.
But right now this element is

691
00:52:38,000 --> 00:52:42,000
just the median of these guys.
Each of these guys is a median

692
00:52:42,000 --> 00:52:45,000
of five elements.
That's all we know.

693
00:52:45,000 --> 00:52:49,000
If we do that recursively,
this is going to take T of n

694
00:52:49,000 --> 00:52:51,000
over 5 time.
So far so good.

695
00:52:51,000 --> 00:52:55,000
We can afford a recursion on a
problem of size n over 5 and

696
00:52:55,000 --> 00:52:58,000
linear work.
We know that works out to

697
00:52:58,000 --> 00:53:00,000
linear time.
But there is more.

698
00:53:00,000 --> 00:53:02,000
We're obviously not done yet.

699
00:53:10,000 --> 00:53:12,000
The next step is x is our
partition element.

700
00:53:12,000 --> 00:53:15,000
We partition there.
The rest of the algorithm is

701
00:53:15,000 --> 00:53:19,000
just like randomized partition,
so we're going to define k to

702
00:53:19,000 --> 00:53:21,000
be the rank of x.
And this can be done,

703
00:53:21,000 --> 00:53:25,000
I mean it's n minus r plus 1 or
whatever, but I'm not going to

704
00:53:25,000 --> 00:53:30,000
write out how to do that because
we're at a higher level here.

705
00:53:30,000 --> 00:53:34,000
But it can be done.
And then we have the three-way

706
00:53:34,000 --> 00:53:37,000
branching.
So if i happens to equal k

707
00:53:37,000 --> 00:53:41,000
we're happy.
The pivot element is the

708
00:53:41,000 --> 00:53:46,000
element we're looking for,
but more likely i is either

709
00:53:46,000 --> 00:53:49,000
less than k or it is bigger than
k.

710
00:53:49,000 --> 00:53:53,000
And then we make the
appropriate recursive call,

711
00:53:53,000 --> 00:54:00,000
so here we recursively select
the i-th smallest element --

712
00:54:08,000 --> 00:54:11,000
-- in the lower part of the
array.

713
00:54:11,000 --> 00:54:16,000
Left of the partition element.
Otherwise, we recursively

714
00:54:16,000 --> 00:54:22,000
select the i minus k-th smallest
element in the upper part of the

715
00:54:22,000 --> 00:54:25,000
array.
I am writing this at a high

716
00:54:25,000 --> 00:54:30,000
level because we've already seen
it.

717
00:54:30,000 --> 00:54:36,000
All of this is the same as the
last couple steps of randomized

718
00:54:36,000 --> 00:54:37,000
select.

719
00:54:45,000 --> 00:54:48,000
That is the algorithm.
The real question is why does

720
00:54:48,000 --> 00:54:50,000
it work?
Why is this linear time?

721
00:54:50,000 --> 00:54:53,000
The first question is what's
the recurrence?

722
00:54:53,000 --> 00:54:56,000
We cannot quite write it down
yet because we don't know how

723
00:54:56,000 --> 00:55:00,000
big these recursive subproblems
could be.

724
00:55:00,000 --> 00:55:03,000
We're going to either recurse
in the lower part or the upper

725
00:55:03,000 --> 00:55:07,000
part, that's just like before.
If we're unlucky and we have a

726
00:55:07,000 --> 00:55:11,000
split of like zero to n minus
one, this is going to be a

727
00:55:11,000 --> 00:55:14,000
quadratic time algorithm.
The claim is that this

728
00:55:14,000 --> 00:55:18,000
partition element is guaranteed
to be pretty good and good

729
00:55:18,000 --> 00:55:21,000
enough.
The running time of this thing

730
00:55:21,000 --> 00:55:24,000
will be T of something times n,
and we don't know what the

731
00:55:24,000 --> 00:55:27,000
something is yet.
How big could it be?

732
00:55:27,000 --> 00:55:32,000
Well, I could ask you.
But we're sort of indirect here

733
00:55:32,000 --> 00:55:34,000
so I will tell you.
We have already a recursive

734
00:55:34,000 --> 00:55:38,000
call of T of n over 5.
It better be that whatever

735
00:55:38,000 --> 00:55:41,000
constant, so it's going to be
something times n,

736
00:55:41,000 --> 00:55:44,000
it better be that that constant
is strictly less than 4/5.

737
00:55:44,000 --> 00:55:48,000
If it's equal to 4/5 then
you're not splitting up the

738
00:55:48,000 --> 00:55:51,000
problem enough to get an n lg n
running time.

739
00:55:51,000 --> 00:55:55,000
If it's strictly less than 4/5
then you're reducing the problem

740
00:55:55,000 --> 00:55:59,000
by at least a constant factor.
In the sense if you add up all

741
00:55:59,000 --> 00:56:03,000
the recursive subproblems,
n over 5 and something times n,

742
00:56:03,000 --> 00:56:07,000
you get something that is a
constant strictly less than one

743
00:56:07,000 --> 00:56:09,000
times n.
That forces the work to be

744
00:56:09,000 --> 00:56:12,000
geometric.
If it's geometric you're going

745
00:56:12,000 --> 00:56:15,000
to get linear time.
So this is intuition but it's

746
00:56:15,000 --> 00:56:18,000
the right intuition.
Whenever you're aiming for

747
00:56:18,000 --> 00:56:21,000
linear time keep that in mind.
If you're doing a

748
00:56:21,000 --> 00:56:24,000
divide-and-conquer,
you've got to get the total

749
00:56:24,000 --> 00:56:27,000
subproblem size to be some
constant less than one times n.

750
00:56:27,000 --> 00:56:32,000
That will work.
OK, so we've got to work out

751
00:56:32,000 --> 00:56:37,000
this constant here.
And we're going to use this

752
00:56:37,000 --> 00:56:42,000
figure, which so far looks
surprisingly uncluttered.

753
00:56:42,000 --> 00:56:48,000
Now we will make it cluttered.
What I would like to do is draw

754
00:56:48,000 --> 00:56:53,000
an arrow between two vertices,
two points, elements,

755
00:56:53,000 --> 00:57:00,000
whatever you want to call them.
Let's call them a and b.

756
00:57:00,000 --> 00:57:04,000
And I want to orient the arrow
so it points to a larger value,

757
00:57:04,000 --> 00:57:06,000
so this means that a is less
than b.

758
00:57:06,000 --> 00:57:09,000
This is notation just for the
diagram.

759
00:57:09,000 --> 00:57:13,000
And so this element,
I am going to write down what I

760
00:57:13,000 --> 00:57:15,000
know.
This element is the median of

761
00:57:15,000 --> 00:57:19,000
these five elements.
I will suppose that it is drawn

762
00:57:19,000 --> 00:57:22,000
so that these elements are
larger than the median,

763
00:57:22,000 --> 00:57:25,000
these elements are smaller than
the median.

764
00:57:25,000 --> 00:57:28,000
Therefore, I have arrows like
this.

765
00:57:28,000 --> 00:57:33,000
Here is where I wish I had some
colored chalk.

766
00:57:33,000 --> 00:57:36,000
This is just stating this guy
is in the middle of those five

767
00:57:36,000 --> 00:57:39,000
elements.
I know that in every single

768
00:57:39,000 --> 00:57:40,000
column.

769
00:57:55,000 --> 00:57:58,000
Here is where the diagram
starts to get messy.

770
00:57:58,000 --> 00:58:01,000
I am not done yet.
Now, we also know that this

771
00:58:01,000 --> 00:58:03,000
element is the median of the
medians.

772
00:58:03,000 --> 00:58:06,000
Of all the squared elements,
this guy is the middle.

773
00:58:06,000 --> 00:58:10,000
And I will draw it so that
these are the ones smaller than

774
00:58:10,000 --> 00:58:13,000
the median, these are the ones
larger than the median.

775
00:58:13,000 --> 00:58:15,000
I mean the algorithm cannot do
this.

776
00:58:15,000 --> 00:58:18,000
It doesn't necessarily know how
all this works.

777
00:58:18,000 --> 00:58:20,000
I guess it could,
but this is just for analysis

778
00:58:20,000 --> 00:58:23,000
purposes.
We know this guy is bigger than

779
00:58:23,000 --> 00:58:25,000
that one and bigger than that
one.

780
00:58:25,000 --> 00:58:29,000
We don't directly know about
the other elements.

781
00:58:29,000 --> 00:58:33,000
We just know that that one is
bigger than both of those and

782
00:58:33,000 --> 00:58:37,000
this guy is smaller than these.
Now, that is as messy as the

783
00:58:37,000 --> 00:58:40,000
figure will get.
Now, the nice thing about less

784
00:58:40,000 --> 00:58:43,000
than is that it's a transitive
relation.

785
00:58:43,000 --> 00:58:47,000
If I have a directed path in
this graph, I know that this

786
00:58:47,000 --> 00:58:51,000
element is strictly less than
that element because this is

787
00:58:51,000 --> 00:58:54,000
less than that one and this is
less than that one.

788
00:58:54,000 --> 00:58:59,000
Even though directly I only
know within a column and within

789
00:58:59,000 --> 00:59:02,000
this middle row,
I actually know that this

790
00:59:02,000 --> 00:59:05,000
element --
This is x, by the way.

791
00:59:05,000 --> 00:59:10,000
This element is larger than all
of these elements because it's

792
00:59:10,000 --> 00:59:15,000
larger than this one and this
one and each of these is larger

793
00:59:15,000 --> 00:59:17,000
than all of those by these
arrows.

794
00:59:17,000 --> 00:59:22,000
I also know that all of these
elements in this rectangle here,

795
00:59:22,000 --> 00:59:27,000
and you don't have to do this
but I will make the background

796
00:59:27,000 --> 00:59:32,000
even more cluttered.
All of these elements in this

797
00:59:32,000 --> 00:59:37,000
rectangle are greater than or
equal to this one and all of the

798
00:59:37,000 --> 00:59:42,000
elements in this rectangle are
less than or equal to x.

799
00:59:42,000 --> 00:59:47,000
Now, how many are there?
Well, this is roughly halfway

800
00:59:47,000 --> 00:59:52,000
along the set of groups and this
is 3/5 of these columns.

801
00:59:52,000 --> 00:59:57,000
So what we get is that there
are at least --

802
00:59:57,000 --> 01:00:03,554
We have n over 5 groups and we
have half of the groups that

803
01:00:03,554 --> 01:00:10,222
we're looking at here roughly,
so let's call that floor of n

804
01:00:10,222 --> 01:00:16,664
over 2, and then within each
group we have three elements.

805
01:00:16,664 --> 01:00:23,219
So we have at least 3 times
floor of floor of n over 5 over

806
01:00:23,219 --> 01:00:30,000
2 n floor elements that are less
than or equal to x.

807
01:00:30,000 --> 01:00:36,222
And we have the same that are
greater than or equal to x.

808
01:00:36,222 --> 01:00:40,444
Let me simplify this a little
bit more.

809
01:00:40,444 --> 01:00:45,222
I can also give you some more
justification,

810
01:00:45,222 --> 01:00:51,222
and we drew the picture,
but just for why this is true.

811
01:00:51,222 --> 01:00:57,777
We have at least n over 5 over
2 group medians that are less

812
01:00:57,777 --> 01:01:02,622
than or equal to x.
This is the argument we use.

813
01:01:02,622 --> 01:01:05,809
We have half of the group
medians are less than or equal

814
01:01:05,809 --> 01:01:08,590
to x because x is the median of
the group median,

815
01:01:08,590 --> 01:01:11,892
so that is no big surprise.
This is almost an equality but

816
01:01:11,892 --> 01:01:14,905
we're making floors so it's
greater than or equal to.

817
01:01:14,905 --> 01:01:18,034
And then, for each group
median, we know that there are

818
01:01:18,034 --> 01:01:21,568
three elements there that are
less than or equal to that group

819
01:01:21,568 --> 01:01:23,133
median.
So, by transitivity,

820
01:01:23,133 --> 01:01:25,218
they're also less than or equal
to x.

821
01:01:25,218 --> 01:01:30,664
We get this number times three.
This is actually just floor of

822
01:01:30,664 --> 01:01:33,773
n over 10.
I was being unnecessarily

823
01:01:33,773 --> 01:01:38,126
complicated there,
but that is where it came from.

824
01:01:38,126 --> 01:01:43,544
What we know is that this thing
is now at least 3 times n over

825
01:01:43,544 --> 01:01:48,252
10, which is roughly 3/10 of
elements are in one side.

826
01:01:48,252 --> 01:01:53,137
In fact, at least 3/10 of the
elements are in each side.

827
01:01:53,137 --> 01:01:59,000
Therefore, each side has at
most 7/10 elements roughly.

828
01:01:59,000 --> 01:02:01,214
So the number here will be
7/10.

829
01:02:01,214 --> 01:02:04,642
And, if I'm lucky,
7/10 plus 1/5 is strictly less

830
01:02:04,642 --> 01:02:06,428
than one.
I believe it is,

831
01:02:06,428 --> 01:02:09,142
but I have trouble working with
tenths.

832
01:02:09,142 --> 01:02:11,357
I can only handle powers of
two.

833
01:02:11,357 --> 01:02:14,857
What we're going to use is a
minor simplification,

834
01:02:14,857 --> 01:02:19,214
which just barely still works,
is a little bit easier to think

835
01:02:19,214 --> 01:02:21,785
about.
It's mainly to get rid of this

836
01:02:21,785 --> 01:02:24,285
floor because the floor is
annoying.

837
01:02:24,285 --> 01:02:28,214
And we don't really have a
sloppiness lemma that applies

838
01:02:28,214 --> 01:02:31,463
here.
It turns out if n is

839
01:02:31,463 --> 01:02:34,975
sufficiently large,
3 times floor of n over 10 is

840
01:02:34,975 --> 01:02:38,707
greater than or equal to 1/4.
Quarters I can handle.

841
01:02:38,707 --> 01:02:42,365
The claim is that each group
has size at least 1/4,

842
01:02:42,365 --> 01:02:46,609
therefore each group has size
at most 3/4 because there's a

843
01:02:46,609 --> 01:02:49,317
quarter on the side.
This will be 3/4.

844
01:02:49,317 --> 01:02:53,048
And I can definitely tell that
1/5 is less than 1/4.

845
01:02:53,048 --> 01:02:57,292
This is going to add up to
something strictly less than one

846
01:02:57,292 --> 01:03:01,292
and then it will work.
How is my time?

847
01:03:01,292 --> 01:03:02,929
Good.
At this point,

848
01:03:02,929 --> 01:03:05,686
the rest of the analysis is
easy.

849
01:03:05,686 --> 01:03:09,993
How the heck you would come up
with this algorithm,

850
01:03:09,993 --> 01:03:14,818
you realize that this is
clearly a really good choice for

851
01:03:14,818 --> 01:03:19,643
finding a partition element,
just barely good enough that

852
01:03:19,643 --> 01:03:22,830
both recursions add up to linear
time.

853
01:03:22,830 --> 01:03:28,000
Well, that's why it took so
many famous people.

854
01:03:28,000 --> 01:03:30,780
Especially in quizzes,
but I think in general this

855
01:03:30,780 --> 01:03:34,241
class, you won't have to come up
with an algorithm this clever

856
01:03:34,241 --> 01:03:37,531
because you can just use this
algorithm to find the median.

857
01:03:37,531 --> 01:03:40,312
And the median is a really good
partition element.

858
01:03:40,312 --> 01:03:43,375
Now that you know this
algorithm, now that we're beyond

859
01:03:43,375 --> 01:03:45,815
1973, you don't need to know how
to do this.

860
01:03:45,815 --> 01:03:48,482
I mean you should know how this
algorithm works,

861
01:03:48,482 --> 01:03:51,943
but you don't need to do this
in another algorithm because you

862
01:03:51,943 --> 01:03:55,234
can just say run this algorithm,
you will get the median in

863
01:03:55,234 --> 01:03:58,524
linear time, and then you can
partition to the left and the

864
01:03:58,524 --> 01:04:02,225
right.
And then the left and the right

865
01:04:02,225 --> 01:04:04,737
will have exactly equal size.
Great.

866
01:04:04,737 --> 01:04:07,321
This is a really powerful
subroutine.

867
01:04:07,321 --> 01:04:11,700
You could use this all over the
place, and you will on Friday.

868
01:04:11,700 --> 01:04:14,858
Have I analyzed the running
time pretty much?

869
01:04:14,858 --> 01:04:18,806
The first step is linear.
The second step is T of n over

870
00:00:05,000 --> 01:04:20,027
The third step,

871
01:04:20,027 --> 01:04:22,037
I didn't write it,
is linear.

872
01:04:22,037 --> 01:04:25,410
And then the last step is just
a recursive call.

873
01:04:25,410 --> 01:04:29,000
And now we know that this is
3/4.

874
01:04:34,000 --> 01:04:40,000
I get this recurrence.
T of n is, I'll say at most,

875
01:04:40,000 --> 01:04:47,079
T of n over 5 plus T of 3/4n.
You could have also used 7/10.

876
01:04:47,079 --> 01:04:54,400
It would give the same answer,
but you would also need a floor

877
01:04:54,400 --> 01:05:01,000
so we won't do that.
I claim that this is linear.

878
01:05:01,000 --> 01:05:07,000
How should I prove it?
Substitution.

879
01:05:12,000 --> 01:05:15,901
Claim that T of n is at most
again c times n,

880
01:05:15,901 --> 01:05:19,891
that will be enough.
Proof is by substitution.

881
01:05:19,891 --> 01:05:23,704
Again, we assume this is true
for smaller n.

882
01:05:23,704 --> 01:05:28,758
And want to prove it for n.
We have T of n is at most this

883
01:05:28,758 --> 01:05:31,489
thing.
T of n over 5.

884
01:05:31,489 --> 01:05:36,489
And by induction,
because n of 5 is smaller than

885
01:05:36,489 --> 01:05:40,000
n, we know that this is at most
c.

886
01:05:40,000 --> 01:05:43,723
Let me write it as c over 5
times n.

887
01:05:43,723 --> 01:05:47,765
Sure, why not.
Then we have here 3/4cn.

888
01:05:47,765 --> 01:05:53,085
And then we have a linear term.
Now, unfortunately,

889
01:05:53,085 --> 01:06:00,000
I have to deal with things that
are not powers of two.

890
01:06:00,000 --> 01:06:02,447
I will cheat and look at my
notes.

891
01:06:02,447 --> 01:06:06,599
This is also known as 19/20
times c times n plus theta n.

892
01:06:06,599 --> 01:06:10,826
And the point is just that this
is strictly less than one.

893
01:06:10,826 --> 01:06:15,202
Because it's strictly less than
one, I can write this as one

894
01:06:15,202 --> 01:06:19,206
times c of n minus some
constant, here it happens to be

895
01:06:19,206 --> 01:06:22,766
1/20, as long as I have
something left over here,

896
01:06:22,766 --> 01:06:26,622
1/20 times c times n.
Then I have this annoying theta

897
01:06:26,622 --> 01:06:30,923
n term which I want to get rid
of because I want this to be

898
01:06:30,923 --> 01:06:34,783
nonnegative.
But it is nonnegative,

899
01:06:34,783 --> 01:06:38,432
as long as I set c to be
really, really large,

900
01:06:38,432 --> 01:06:41,918
at least 20 times whatever
constant is here.

901
01:06:41,918 --> 01:06:46,216
So this is at most c times n
for c sufficiently large.

902
01:06:46,216 --> 01:06:50,189
And, oh, by the way,
if n is less than or equal to

903
01:06:50,189 --> 01:06:54,404
50, which we used up here,
then T of n is a constant,

904
01:06:54,404 --> 01:06:59,270
it doesn't really matter what
you do, and T of n is at most c

905
01:06:59,270 --> 01:07:03,000
times n for c sufficiently
large.

906
01:07:03,000 --> 01:07:06,017
That proves this claim.
Of course, the constant here is

907
01:07:06,017 --> 01:07:08,421
pretty damn big.
It depends exactly what the

908
01:07:08,421 --> 01:07:11,606
constants and the running times
are, which depends on your

909
01:07:11,606 --> 01:07:14,960
machine, but practically this
algorithm is not so hot because

910
01:07:14,960 --> 01:07:18,089
the constants are pretty big.
Even though this element is

911
01:07:18,089 --> 01:07:20,772
guaranteed to be somewhere
vaguely in the middle,

912
01:07:20,772 --> 01:07:23,566
and even though these
recursions add up to strictly

913
01:07:23,566 --> 01:07:26,752
less than n and it's geometric,
it's geometric because the

914
01:07:26,752 --> 01:07:31,000
problem is reducing by at least
a factor of 19/20 each time.

915
01:07:31,000 --> 01:07:34,742
So it actually takes a while
for the problem to get really

916
01:07:34,742 --> 01:07:37,106
small.
Practically you probably don't

917
01:07:37,106 --> 01:07:40,782
want to use this algorithm
unless you cannot somehow flip

918
01:07:40,782 --> 01:07:43,146
coins.
The randomized algorithm works

919
01:07:43,146 --> 01:07:46,166
really, really fast.
Theoretically this is your

920
01:07:46,166 --> 01:07:50,237
dream, the best you could hope
for because it's linear time and

921
01:07:50,237 --> 01:07:53,257
you need linear time as
guaranteed linear time.

922
01:07:53,257 --> 01:07:55,161
I will mention,
before we end,

923
01:07:55,161 --> 01:07:57,000
an exercise.

924
01:08:03,000 --> 01:08:06,375
Why did we use groups of five?
Why not groups of three?

925
01:08:06,375 --> 01:08:09,062
As you might guess,
the answer is because it

926
01:08:09,062 --> 01:08:11,125
doesn't work with groups of
three.

927
01:08:11,125 --> 01:08:13,812
But it's quite constructive to
find out why.

928
01:08:13,812 --> 01:08:17,562
If you work through this math
with groups of three instead of

929
01:08:17,562 --> 01:08:20,250
groups of five,
you will find that you don't

930
01:08:20,250 --> 01:08:23,062
quite get the problem reduction
that you need.

931
01:08:23,062 --> 01:08:27,000
Five is the smallest number for
which this works.

932
01:08:27,000 --> 01:08:30,176
It would work with seven,
but theoretically not any

933
01:08:30,176 --> 01:08:32,973
better than a constant factor.
Any questions?

934
01:08:32,973 --> 01:08:35,069
All right.
Then recitation Friday.

935
01:08:35,069 --> 01:08:37,801
Homework lab Sunday.
Problem set due Monday.

936
01:08:37,801 --> 01:08:40,000
Quiz one in two weeks.