1
00:00:00,499 --> 00:00:03,500
PROFESSOR: PageRank is a measure
of the importance of a web

2
00:00:03,500 --> 00:00:04,450
page.

3
00:00:04,450 --> 00:00:08,039
But let me immediately
correct my own confusion

4
00:00:08,039 --> 00:00:09,530
that I suffered
from for some time

5
00:00:09,530 --> 00:00:12,680
until very recently, which
is that even though PageRank

6
00:00:12,680 --> 00:00:15,880
is used for ranking pages,
it's called PageRank

7
00:00:15,880 --> 00:00:20,190
after its discoverer,
developer, Larry Page,

8
00:00:20,190 --> 00:00:25,620
was one of the co-founders
along with Serg Brin of Google.

9
00:00:25,620 --> 00:00:30,790
So the motivation is that when
you-- at least before Google,

10
00:00:30,790 --> 00:00:33,520
when you did a standard
retrieval on a web page

11
00:00:33,520 --> 00:00:36,700
using keyword Search and
similar kinds of criteria,

12
00:00:36,700 --> 00:00:40,110
you'd get back millions
of hits, most of which

13
00:00:40,110 --> 00:00:42,950
were really low
quality and you weren't

14
00:00:42,950 --> 00:00:45,850
interested in, and with
a few useful pages buried

15
00:00:45,850 --> 00:00:46,880
in the millions.

16
00:00:46,880 --> 00:00:50,110
And the question was,
all of these documents

17
00:00:50,110 --> 00:00:53,660
are indistinguishable in terms
of keyword search and textual

18
00:00:53,660 --> 00:00:56,800
patterns, how do you figure out
which are the important ones.

19
00:00:56,800 --> 00:01:00,240
And the idea that
Page came up with

20
00:01:00,240 --> 00:01:02,670
was to use the web
structure itself,

21
00:01:02,670 --> 00:01:04,550
the structure of
the worldwide web,

22
00:01:04,550 --> 00:01:07,820
to identify the
important documents.

23
00:01:07,820 --> 00:01:13,300
So we can think of the
whole internet as a graph

24
00:01:13,300 --> 00:01:18,470
where a user is on a page, and
we think of a URL as a link

25
00:01:18,470 --> 00:01:20,590
to another page,
as a directed edge.

26
00:01:20,590 --> 00:01:23,890
And users are kind of
randomly traveling around

27
00:01:23,890 --> 00:01:25,010
in the worldwide web.

28
00:01:25,010 --> 00:01:26,930
They're at a page, they
randomly click a link

29
00:01:26,930 --> 00:01:28,810
to get to another
page, and they keep

30
00:01:28,810 --> 00:01:31,274
doing a walk on the web graph.

31
00:01:31,274 --> 00:01:32,690
And every once in
a while, they're

32
00:01:32,690 --> 00:01:35,140
going to find that the
thread that they're on

33
00:01:35,140 --> 00:01:37,560
is kind of losing steam,
where they find themselves

34
00:01:37,560 --> 00:01:41,900
in some kind of a cycle and they
will randomly start over again

35
00:01:41,900 --> 00:01:43,710
at some other page.

36
00:01:43,710 --> 00:01:47,740
And we want to
argue or hypothesize

37
00:01:47,740 --> 00:01:50,310
that a page is
more important when

38
00:01:50,310 --> 00:01:52,210
it's viewed a large
fraction of the time

39
00:01:52,210 --> 00:01:56,350
by these random browsers
and random users.

40
00:01:56,350 --> 00:01:58,330
So to be formal,
we're going to take

41
00:01:58,330 --> 00:02:02,050
the entire worldwide web,
trillions of vertices,

42
00:02:02,050 --> 00:02:04,070
as a digraph.

43
00:02:04,070 --> 00:02:07,670
And there's going to be an
edge from one URL to another,

44
00:02:07,670 --> 00:02:14,800
from V to W, if there's a link
from the page V to the page W,

45
00:02:14,800 --> 00:02:17,160
or the URL W. W might
not even be a page,

46
00:02:17,160 --> 00:02:18,930
it might be a document,
which means it

47
00:02:18,930 --> 00:02:20,220
doesn't have any links on it.

48
00:02:20,220 --> 00:02:24,150
But for the real vertices
are the web pages

49
00:02:24,150 --> 00:02:26,630
that have links on them.

50
00:02:26,630 --> 00:02:28,600
OK, that's the model.

51
00:02:28,600 --> 00:02:31,480
And we're going to make it
into a random walk graph

52
00:02:31,480 --> 00:02:36,140
by saying that if you look
at a URL V, at a vertex V,

53
00:02:36,140 --> 00:02:39,010
all of the edges out of
it are equally likely.

54
00:02:39,010 --> 00:02:41,830
It's a simple model, and
it might or might not work.

55
00:02:41,830 --> 00:02:43,850
But in fact, it did
work pretty well.

56
00:02:43,850 --> 00:02:49,650
That is the model of the
worldwide web as a random walk

57
00:02:49,650 --> 00:02:50,760
graph.

58
00:02:50,760 --> 00:02:52,960
So to be more precise,
the probability

59
00:02:52,960 --> 00:02:55,270
of the edge that
goes from V to W

60
00:02:55,270 --> 00:02:58,460
is 1 over the out
degree of V. That

61
00:02:58,460 --> 00:03:01,640
is, all of the out
degree of V edges

62
00:03:01,640 --> 00:03:05,296
leaving vertex V
get equal weight.

63
00:03:05,296 --> 00:03:08,160
Now to model this aspect
that the users start over

64
00:03:08,160 --> 00:03:10,850
again if they get bored
or they get stuck,

65
00:03:10,850 --> 00:03:13,990
we can formally
add to the digraph

66
00:03:13,990 --> 00:03:18,820
a hypothetical super-node,
which-- and with the property

67
00:03:18,820 --> 00:03:22,070
that there's an edge from the
super-node to every other node

68
00:03:22,070 --> 00:03:23,230
with equally likelihood.

69
00:03:23,230 --> 00:03:25,360
So once you hit
the super-node then

70
00:03:25,360 --> 00:03:28,690
following an edge is tantamount
to saying, pick a random page

71
00:03:28,690 --> 00:03:31,430
and start over again.

72
00:03:31,430 --> 00:03:35,090
To get to the
super-node, we have

73
00:03:35,090 --> 00:03:38,570
edges back from other
nodes in the graph

74
00:03:38,570 --> 00:03:40,470
back to the super-node.

75
00:03:40,470 --> 00:03:43,580
In the reading, we
said that we were

76
00:03:43,580 --> 00:03:47,990
going to have nodes back
from terminal nodes that

77
00:03:47,990 --> 00:03:49,450
had no edges out.

78
00:03:49,450 --> 00:03:51,790
For example, a document
or something like that.

79
00:03:51,790 --> 00:03:56,020
That's actually not sufficient,
because-- for the PageRank

80
00:03:56,020 --> 00:03:58,200
to work in the theoretical
way that we want it

81
00:03:58,200 --> 00:04:02,490
to because even if
there is no dead nodes,

82
00:04:02,490 --> 00:04:06,012
you might be in a clump of nodes
which you can't get out of.

83
00:04:06,012 --> 00:04:08,470
And you'd want to be able to--
and even though none of them

84
00:04:08,470 --> 00:04:10,678
was a dead end, because they
all had arrows going out

85
00:04:10,678 --> 00:04:11,310
to each other.

86
00:04:11,310 --> 00:04:13,970
And so you'd really want
a node from a-- an edge

87
00:04:13,970 --> 00:04:16,269
from a clump like that
back to the super-node

88
00:04:16,269 --> 00:04:18,290
to model starting over there.

89
00:04:18,290 --> 00:04:20,010
The simplest way
to do it really is

90
00:04:20,010 --> 00:04:22,940
to simply say that there's
an edge to the super-node

91
00:04:22,940 --> 00:04:24,600
from every vertex.

92
00:04:24,600 --> 00:04:28,800
So wherever you are, you can
randomly decide to start over.

93
00:04:28,800 --> 00:04:32,900
And Page and Brin
and their co-authors

94
00:04:32,900 --> 00:04:35,490
in the original
paper on PageRank

95
00:04:35,490 --> 00:04:38,740
suggested that the
edge back from a vertex

96
00:04:38,740 --> 00:04:42,370
to the super vertex might
get a special probability.

97
00:04:42,370 --> 00:04:44,650
It might be
customized, as opposed

98
00:04:44,650 --> 00:04:49,180
to being equally likely with
all of the other edges leading

99
00:04:49,180 --> 00:04:50,170
a vertex.

100
00:04:50,170 --> 00:04:52,190
In fact, I think they
decided that there should

101
00:04:52,190 --> 00:04:57,350
be a 0.15 probability from
each vertex of jumping

102
00:04:57,350 --> 00:05:01,140
at random to the super-node.

103
00:05:01,140 --> 00:05:03,360
OK.

104
00:05:03,360 --> 00:05:05,600
Let's just illustrate
this with an example.

105
00:05:05,600 --> 00:05:08,830
This is a random walk graph that
we've seen before modeling coin

106
00:05:08,830 --> 00:05:09,330
flipping.

107
00:05:09,330 --> 00:05:10,913
And when I add the
super-node, there's

108
00:05:10,913 --> 00:05:15,470
this one new vertex
super, and there's

109
00:05:15,470 --> 00:05:19,040
an edge from the super
vertex to every other one

110
00:05:19,040 --> 00:05:20,860
of the vertices in the graph.

111
00:05:20,860 --> 00:05:23,070
And from each
vertex in the graph,

112
00:05:23,070 --> 00:05:24,670
there is an edge going back.

113
00:05:24,670 --> 00:05:28,520
I've illustrated that
with two-way arrows.

114
00:05:28,520 --> 00:05:32,050
So this is really an
arrow with two arrowheads.

115
00:05:32,050 --> 00:05:34,410
It represents an arrow
in each direction.

116
00:05:34,410 --> 00:05:37,750
Now in the original
paper, actually, Page

117
00:05:37,750 --> 00:05:39,560
didn't talk about
a super vertex.

118
00:05:39,560 --> 00:05:43,660
Instead, he talked about
each vertex randomly jumping

119
00:05:43,660 --> 00:05:45,040
to another vertex.

120
00:05:45,040 --> 00:05:47,740
But that would just get the
whole state diagram completely

121
00:05:47,740 --> 00:05:49,510
clogged up with
edges, so it's more

122
00:05:49,510 --> 00:05:53,660
economical to have everybody
jump to the super vertex

123
00:05:53,660 --> 00:05:56,570
and the super vertex
jump back to everybody.

124
00:05:56,570 --> 00:06:02,140
And that saves a
significant number of edges.

125
00:06:02,140 --> 00:06:06,190
So PageRank, then, is
obtained by computing

126
00:06:06,190 --> 00:06:10,670
a stationary distribution
for the worldwide web.

127
00:06:10,670 --> 00:06:14,090
So s bar is a vector
of length trillions

128
00:06:14,090 --> 00:06:18,550
that the coordinates are
indexed by the web pages.

129
00:06:18,550 --> 00:06:21,080
And we want to calculate
the stable distribution.

130
00:06:21,080 --> 00:06:25,360
And then we'll simply define
the page rank of a page

131
00:06:25,360 --> 00:06:28,950
is its probability
of being there

132
00:06:28,950 --> 00:06:30,520
in the stationary
distribution, the v

133
00:06:30,520 --> 00:06:35,670
component of the stable--
stationary distribution, s.

134
00:06:35,670 --> 00:06:37,520
And of course,
we'll rank v above s

135
00:06:37,520 --> 00:06:40,045
when the probability
of being in v

136
00:06:40,045 --> 00:06:45,680
is higher than the
probability of being in w.

137
00:06:45,680 --> 00:06:50,050
By the way, I don't
have the latest figures,

138
00:06:50,050 --> 00:06:55,245
but there were-- I guess
I've heard people who've

139
00:06:55,245 --> 00:06:58,070
worked for Google say, and in
some of the Wikipedia articles,

140
00:06:58,070 --> 00:07:02,010
that it takes a few
weeks for the crawlers

141
00:07:02,010 --> 00:07:07,550
to create a new map of the
web, to create the new graph.

142
00:07:07,550 --> 00:07:11,280
And then it takes
some number of hours,

143
00:07:11,280 --> 00:07:15,540
I think under days, to calculate
the stationary distribution

144
00:07:15,540 --> 00:07:21,540
on the graph, doing a lot
of parallel computation.

145
00:07:21,540 --> 00:07:26,620
So a useful feature about using
the stationary distribution

146
00:07:26,620 --> 00:07:32,780
is that ways to hack the
links in the worldwide web

147
00:07:32,780 --> 00:07:36,440
to make a page look
important are-- will not work

148
00:07:36,440 --> 00:07:37,840
very well against PageRank.

149
00:07:37,840 --> 00:07:40,450
So for example, one way
to look more important

150
00:07:40,450 --> 00:07:42,160
is to create a lot
of nodes pointing

151
00:07:42,160 --> 00:07:44,040
to yourself, fake nodes.

152
00:07:44,040 --> 00:07:46,650
But that's not going to matter,
because the fake nodes are not

153
00:07:46,650 --> 00:07:48,483
going to have much
weight since they're fake

154
00:07:48,483 --> 00:07:49,890
and nobody's pointing to them.

155
00:07:49,890 --> 00:07:53,960
So even though a large number
of fake nodes point to you,

156
00:07:53,960 --> 00:07:56,700
their cumulative weight is low,
and they're not adding a lot

157
00:07:56,700 --> 00:07:58,790
to your own probability.

158
00:07:58,790 --> 00:08:04,199
Likewise, you could try taking
links to important pages

159
00:08:04,199 --> 00:08:06,240
and try to make yourself
look important that way,

160
00:08:06,240 --> 00:08:08,080
but PageRank won't
make you look important

161
00:08:08,080 --> 00:08:10,320
at all if none of
those important nodes

162
00:08:10,320 --> 00:08:11,850
are pointing back.

163
00:08:11,850 --> 00:08:14,450
So both of these
simple-minded ways

164
00:08:14,450 --> 00:08:17,910
to try to look important
by manipulating links

165
00:08:17,910 --> 00:08:21,420
won't improve your page rank.

166
00:08:21,420 --> 00:08:24,510
The super-node is
playing a technical role

167
00:08:24,510 --> 00:08:28,900
in making sure that the
stationary distribution exists.

168
00:08:28,900 --> 00:08:33,539
So it guarantees that there's a
unique stationary distribution,

169
00:08:33,539 --> 00:08:34,039
s bar.

170
00:08:34,039 --> 00:08:35,770
By the way, I sometimes
use the word stable

171
00:08:35,770 --> 00:08:36,820
and sometimes stationary.

172
00:08:36,820 --> 00:08:38,881
They're kind of
synonyms, although I

173
00:08:38,881 --> 00:08:40,714
think officially we
should stick to the word

174
00:08:40,714 --> 00:08:43,909
stationary distribution.

175
00:08:43,909 --> 00:08:48,490
As I've mentioned before, when
a digraph is strongly connected,

176
00:08:48,490 --> 00:08:50,950
that is a sufficient
condition for there

177
00:08:50,950 --> 00:08:54,130
to be a unique
stable distribution.

178
00:08:54,130 --> 00:08:57,210
That's actually proved in one
of the exercises in the text

179
00:08:57,210 --> 00:09:00,620
at the end of the chapter.

180
00:09:00,620 --> 00:09:03,770
The super-node
mechanism also ensures

181
00:09:03,770 --> 00:09:07,570
something even stronger, that
every initial distribution

182
00:09:07,570 --> 00:09:12,780
p converges to the
stationary distribution,

183
00:09:12,780 --> 00:09:14,530
to that unique
stationary distribution.

184
00:09:14,530 --> 00:09:18,420
Stated precisely
mathematically, if you start off

185
00:09:18,420 --> 00:09:21,780
at an arbitrary distribution
of probabilities

186
00:09:21,780 --> 00:09:24,350
of being in different
states, p, and you

187
00:09:24,350 --> 00:09:28,360
look at what happens to p
after t steps-- remember,

188
00:09:28,360 --> 00:09:33,470
that you get by multiplying the
vector p by the matrix M raised

189
00:09:33,470 --> 00:09:36,460
to the power t-- and you take
the limit as t approaches

190
00:09:36,460 --> 00:09:40,080
infinity, that is to
say, what distribution

191
00:09:40,080 --> 00:09:44,580
do you approach as you
do more and more updates.

192
00:09:44,580 --> 00:09:46,520
And it turns out that
that limit exists,

193
00:09:46,520 --> 00:09:48,399
and it is that
stationary distribution.

194
00:09:48,399 --> 00:09:49,940
So it doesn't matter
where you start,

195
00:09:49,940 --> 00:09:51,577
you're going to wind up stable.

196
00:09:51,577 --> 00:09:53,660
And as a matter of fact,
the convergence is rapid.

197
00:09:53,660 --> 00:09:55,980
What that means is
that you can actually

198
00:09:55,980 --> 00:09:58,630
calculate the stable
distribution reasonably

199
00:09:58,630 --> 00:10:01,590
quickly, because you don't
need a very large t in order

200
00:10:01,590 --> 00:10:03,780
to arrive at a very
good approximation

201
00:10:03,780 --> 00:10:06,390
to the stable distribution.

202
00:10:06,390 --> 00:10:09,480
Now the actual Google
rank and ranking

203
00:10:09,480 --> 00:10:11,900
is more complicated
than just PageRank.

204
00:10:11,900 --> 00:10:15,120
PageRank was the original idea
that got a lot of attention.

205
00:10:15,120 --> 00:10:17,200
And in fact, the latest
information from Google

206
00:10:17,200 --> 00:10:20,190
is that they think it
gets overattention today

207
00:10:20,190 --> 00:10:23,840
in the modern world by too many
commentators and people trying

208
00:10:23,840 --> 00:10:26,420
to simulate ranking.

209
00:10:26,420 --> 00:10:30,070
So the actual rank rules are
a closely-held trade secret

210
00:10:30,070 --> 00:10:32,070
for Google-- by Google.

211
00:10:32,070 --> 00:10:34,790
They use text, they use
location, they use payments,

212
00:10:34,790 --> 00:10:37,450
because advertisers can
pay to have their search

213
00:10:37,450 --> 00:10:41,730
results listed more prominently,
and lots of other criteria

214
00:10:41,730 --> 00:10:43,750
that have evolved over 15 years.

215
00:10:43,750 --> 00:10:44,950
And they continue to evolve.

216
00:10:44,950 --> 00:10:47,670
As people find ways to
manipulate the ranking,

217
00:10:47,670 --> 00:10:51,730
Google revises its ranking
criteria and algorithms.

218
00:10:51,730 --> 00:10:54,140
But nevertheless,
PageRank continues

219
00:10:54,140 --> 00:10:58,250
to play a significant
role in the whole story.