1 00:00:00,499 --> 00:00:03,500 PROFESSOR: PageRank is a measure of the importance of a web 2 00:00:03,500 --> 00:00:04,450 page. 3 00:00:04,450 --> 00:00:08,039 But let me immediately correct my own confusion 4 00:00:08,039 --> 00:00:09,530 that I suffered from for some time 5 00:00:09,530 --> 00:00:12,680 until very recently, which is that even though PageRank 6 00:00:12,680 --> 00:00:15,880 is used for ranking pages, it's called PageRank 7 00:00:15,880 --> 00:00:20,190 after its discoverer, developer, Larry Page, 8 00:00:20,190 --> 00:00:25,620 was one of the co-founders along with Serg Brin of Google. 9 00:00:25,620 --> 00:00:30,790 So the motivation is that when you-- at least before Google, 10 00:00:30,790 --> 00:00:33,520 when you did a standard retrieval on a web page 11 00:00:33,520 --> 00:00:36,700 using keyword Search and similar kinds of criteria, 12 00:00:36,700 --> 00:00:40,110 you'd get back millions of hits, most of which 13 00:00:40,110 --> 00:00:42,950 were really low quality and you weren't 14 00:00:42,950 --> 00:00:45,850 interested in, and with a few useful pages buried 15 00:00:45,850 --> 00:00:46,880 in the millions. 16 00:00:46,880 --> 00:00:50,110 And the question was, all of these documents 17 00:00:50,110 --> 00:00:53,660 are indistinguishable in terms of keyword search and textual 18 00:00:53,660 --> 00:00:56,800 patterns, how do you figure out which are the important ones. 19 00:00:56,800 --> 00:01:00,240 And the idea that Page came up with 20 00:01:00,240 --> 00:01:02,670 was to use the web structure itself, 21 00:01:02,670 --> 00:01:04,550 the structure of the worldwide web, 22 00:01:04,550 --> 00:01:07,820 to identify the important documents. 23 00:01:07,820 --> 00:01:13,300 So we can think of the whole internet as a graph 24 00:01:13,300 --> 00:01:18,470 where a user is on a page, and we think of a URL as a link 25 00:01:18,470 --> 00:01:20,590 to another page, as a directed edge. 26 00:01:20,590 --> 00:01:23,890 And users are kind of randomly traveling around 27 00:01:23,890 --> 00:01:25,010 in the worldwide web. 28 00:01:25,010 --> 00:01:26,930 They're at a page, they randomly click a link 29 00:01:26,930 --> 00:01:28,810 to get to another page, and they keep 30 00:01:28,810 --> 00:01:31,274 doing a walk on the web graph. 31 00:01:31,274 --> 00:01:32,690 And every once in a while, they're 32 00:01:32,690 --> 00:01:35,140 going to find that the thread that they're on 33 00:01:35,140 --> 00:01:37,560 is kind of losing steam, where they find themselves 34 00:01:37,560 --> 00:01:41,900 in some kind of a cycle and they will randomly start over again 35 00:01:41,900 --> 00:01:43,710 at some other page. 36 00:01:43,710 --> 00:01:47,740 And we want to argue or hypothesize 37 00:01:47,740 --> 00:01:50,310 that a page is more important when 38 00:01:50,310 --> 00:01:52,210 it's viewed a large fraction of the time 39 00:01:52,210 --> 00:01:56,350 by these random browsers and random users. 40 00:01:56,350 --> 00:01:58,330 So to be formal, we're going to take 41 00:01:58,330 --> 00:02:02,050 the entire worldwide web, trillions of vertices, 42 00:02:02,050 --> 00:02:04,070 as a digraph. 43 00:02:04,070 --> 00:02:07,670 And there's going to be an edge from one URL to another, 44 00:02:07,670 --> 00:02:14,800 from V to W, if there's a link from the page V to the page W, 45 00:02:14,800 --> 00:02:17,160 or the URL W. W might not even be a page, 46 00:02:17,160 --> 00:02:18,930 it might be a document, which means it 47 00:02:18,930 --> 00:02:20,220 doesn't have any links on it. 48 00:02:20,220 --> 00:02:24,150 But for the real vertices are the web pages 49 00:02:24,150 --> 00:02:26,630 that have links on them. 50 00:02:26,630 --> 00:02:28,600 OK, that's the model. 51 00:02:28,600 --> 00:02:31,480 And we're going to make it into a random walk graph 52 00:02:31,480 --> 00:02:36,140 by saying that if you look at a URL V, at a vertex V, 53 00:02:36,140 --> 00:02:39,010 all of the edges out of it are equally likely. 54 00:02:39,010 --> 00:02:41,830 It's a simple model, and it might or might not work. 55 00:02:41,830 --> 00:02:43,850 But in fact, it did work pretty well. 56 00:02:43,850 --> 00:02:49,650 That is the model of the worldwide web as a random walk 57 00:02:49,650 --> 00:02:50,760 graph. 58 00:02:50,760 --> 00:02:52,960 So to be more precise, the probability 59 00:02:52,960 --> 00:02:55,270 of the edge that goes from V to W 60 00:02:55,270 --> 00:02:58,460 is 1 over the out degree of V. That 61 00:02:58,460 --> 00:03:01,640 is, all of the out degree of V edges 62 00:03:01,640 --> 00:03:05,296 leaving vertex V get equal weight. 63 00:03:05,296 --> 00:03:08,160 Now to model this aspect that the users start over 64 00:03:08,160 --> 00:03:10,850 again if they get bored or they get stuck, 65 00:03:10,850 --> 00:03:13,990 we can formally add to the digraph 66 00:03:13,990 --> 00:03:18,820 a hypothetical super-node, which-- and with the property 67 00:03:18,820 --> 00:03:22,070 that there's an edge from the super-node to every other node 68 00:03:22,070 --> 00:03:23,230 with equally likelihood. 69 00:03:23,230 --> 00:03:25,360 So once you hit the super-node then 70 00:03:25,360 --> 00:03:28,690 following an edge is tantamount to saying, pick a random page 71 00:03:28,690 --> 00:03:31,430 and start over again. 72 00:03:31,430 --> 00:03:35,090 To get to the super-node, we have 73 00:03:35,090 --> 00:03:38,570 edges back from other nodes in the graph 74 00:03:38,570 --> 00:03:40,470 back to the super-node. 75 00:03:40,470 --> 00:03:43,580 In the reading, we said that we were 76 00:03:43,580 --> 00:03:47,990 going to have nodes back from terminal nodes that 77 00:03:47,990 --> 00:03:49,450 had no edges out. 78 00:03:49,450 --> 00:03:51,790 For example, a document or something like that. 79 00:03:51,790 --> 00:03:56,020 That's actually not sufficient, because-- for the PageRank 80 00:03:56,020 --> 00:03:58,200 to work in the theoretical way that we want it 81 00:03:58,200 --> 00:04:02,490 to because even if there is no dead nodes, 82 00:04:02,490 --> 00:04:06,012 you might be in a clump of nodes which you can't get out of. 83 00:04:06,012 --> 00:04:08,470 And you'd want to be able to-- and even though none of them 84 00:04:08,470 --> 00:04:10,678 was a dead end, because they all had arrows going out 85 00:04:10,678 --> 00:04:11,310 to each other. 86 00:04:11,310 --> 00:04:13,970 And so you'd really want a node from a-- an edge 87 00:04:13,970 --> 00:04:16,269 from a clump like that back to the super-node 88 00:04:16,269 --> 00:04:18,290 to model starting over there. 89 00:04:18,290 --> 00:04:20,010 The simplest way to do it really is 90 00:04:20,010 --> 00:04:22,940 to simply say that there's an edge to the super-node 91 00:04:22,940 --> 00:04:24,600 from every vertex. 92 00:04:24,600 --> 00:04:28,800 So wherever you are, you can randomly decide to start over. 93 00:04:28,800 --> 00:04:32,900 And Page and Brin and their co-authors 94 00:04:32,900 --> 00:04:35,490 in the original paper on PageRank 95 00:04:35,490 --> 00:04:38,740 suggested that the edge back from a vertex 96 00:04:38,740 --> 00:04:42,370 to the super vertex might get a special probability. 97 00:04:42,370 --> 00:04:44,650 It might be customized, as opposed 98 00:04:44,650 --> 00:04:49,180 to being equally likely with all of the other edges leading 99 00:04:49,180 --> 00:04:50,170 a vertex. 100 00:04:50,170 --> 00:04:52,190 In fact, I think they decided that there should 101 00:04:52,190 --> 00:04:57,350 be a 0.15 probability from each vertex of jumping 102 00:04:57,350 --> 00:05:01,140 at random to the super-node. 103 00:05:01,140 --> 00:05:03,360 OK. 104 00:05:03,360 --> 00:05:05,600 Let's just illustrate this with an example. 105 00:05:05,600 --> 00:05:08,830 This is a random walk graph that we've seen before modeling coin 106 00:05:08,830 --> 00:05:09,330 flipping. 107 00:05:09,330 --> 00:05:10,913 And when I add the super-node, there's 108 00:05:10,913 --> 00:05:15,470 this one new vertex super, and there's 109 00:05:15,470 --> 00:05:19,040 an edge from the super vertex to every other one 110 00:05:19,040 --> 00:05:20,860 of the vertices in the graph. 111 00:05:20,860 --> 00:05:23,070 And from each vertex in the graph, 112 00:05:23,070 --> 00:05:24,670 there is an edge going back. 113 00:05:24,670 --> 00:05:28,520 I've illustrated that with two-way arrows. 114 00:05:28,520 --> 00:05:32,050 So this is really an arrow with two arrowheads. 115 00:05:32,050 --> 00:05:34,410 It represents an arrow in each direction. 116 00:05:34,410 --> 00:05:37,750 Now in the original paper, actually, Page 117 00:05:37,750 --> 00:05:39,560 didn't talk about a super vertex. 118 00:05:39,560 --> 00:05:43,660 Instead, he talked about each vertex randomly jumping 119 00:05:43,660 --> 00:05:45,040 to another vertex. 120 00:05:45,040 --> 00:05:47,740 But that would just get the whole state diagram completely 121 00:05:47,740 --> 00:05:49,510 clogged up with edges, so it's more 122 00:05:49,510 --> 00:05:53,660 economical to have everybody jump to the super vertex 123 00:05:53,660 --> 00:05:56,570 and the super vertex jump back to everybody. 124 00:05:56,570 --> 00:06:02,140 And that saves a significant number of edges. 125 00:06:02,140 --> 00:06:06,190 So PageRank, then, is obtained by computing 126 00:06:06,190 --> 00:06:10,670 a stationary distribution for the worldwide web. 127 00:06:10,670 --> 00:06:14,090 So s bar is a vector of length trillions 128 00:06:14,090 --> 00:06:18,550 that the coordinates are indexed by the web pages. 129 00:06:18,550 --> 00:06:21,080 And we want to calculate the stable distribution. 130 00:06:21,080 --> 00:06:25,360 And then we'll simply define the page rank of a page 131 00:06:25,360 --> 00:06:28,950 is its probability of being there 132 00:06:28,950 --> 00:06:30,520 in the stationary distribution, the v 133 00:06:30,520 --> 00:06:35,670 component of the stable-- stationary distribution, s. 134 00:06:35,670 --> 00:06:37,520 And of course, we'll rank v above s 135 00:06:37,520 --> 00:06:40,045 when the probability of being in v 136 00:06:40,045 --> 00:06:45,680 is higher than the probability of being in w. 137 00:06:45,680 --> 00:06:50,050 By the way, I don't have the latest figures, 138 00:06:50,050 --> 00:06:55,245 but there were-- I guess I've heard people who've 139 00:06:55,245 --> 00:06:58,070 worked for Google say, and in some of the Wikipedia articles, 140 00:06:58,070 --> 00:07:02,010 that it takes a few weeks for the crawlers 141 00:07:02,010 --> 00:07:07,550 to create a new map of the web, to create the new graph. 142 00:07:07,550 --> 00:07:11,280 And then it takes some number of hours, 143 00:07:11,280 --> 00:07:15,540 I think under days, to calculate the stationary distribution 144 00:07:15,540 --> 00:07:21,540 on the graph, doing a lot of parallel computation. 145 00:07:21,540 --> 00:07:26,620 So a useful feature about using the stationary distribution 146 00:07:26,620 --> 00:07:32,780 is that ways to hack the links in the worldwide web 147 00:07:32,780 --> 00:07:36,440 to make a page look important are-- will not work 148 00:07:36,440 --> 00:07:37,840 very well against PageRank. 149 00:07:37,840 --> 00:07:40,450 So for example, one way to look more important 150 00:07:40,450 --> 00:07:42,160 is to create a lot of nodes pointing 151 00:07:42,160 --> 00:07:44,040 to yourself, fake nodes. 152 00:07:44,040 --> 00:07:46,650 But that's not going to matter, because the fake nodes are not 153 00:07:46,650 --> 00:07:48,483 going to have much weight since they're fake 154 00:07:48,483 --> 00:07:49,890 and nobody's pointing to them. 155 00:07:49,890 --> 00:07:53,960 So even though a large number of fake nodes point to you, 156 00:07:53,960 --> 00:07:56,700 their cumulative weight is low, and they're not adding a lot 157 00:07:56,700 --> 00:07:58,790 to your own probability. 158 00:07:58,790 --> 00:08:04,199 Likewise, you could try taking links to important pages 159 00:08:04,199 --> 00:08:06,240 and try to make yourself look important that way, 160 00:08:06,240 --> 00:08:08,080 but PageRank won't make you look important 161 00:08:08,080 --> 00:08:10,320 at all if none of those important nodes 162 00:08:10,320 --> 00:08:11,850 are pointing back. 163 00:08:11,850 --> 00:08:14,450 So both of these simple-minded ways 164 00:08:14,450 --> 00:08:17,910 to try to look important by manipulating links 165 00:08:17,910 --> 00:08:21,420 won't improve your page rank. 166 00:08:21,420 --> 00:08:24,510 The super-node is playing a technical role 167 00:08:24,510 --> 00:08:28,900 in making sure that the stationary distribution exists. 168 00:08:28,900 --> 00:08:33,539 So it guarantees that there's a unique stationary distribution, 169 00:08:33,539 --> 00:08:34,039 s bar. 170 00:08:34,039 --> 00:08:35,770 By the way, I sometimes use the word stable 171 00:08:35,770 --> 00:08:36,820 and sometimes stationary. 172 00:08:36,820 --> 00:08:38,881 They're kind of synonyms, although I 173 00:08:38,881 --> 00:08:40,714 think officially we should stick to the word 174 00:08:40,714 --> 00:08:43,909 stationary distribution. 175 00:08:43,909 --> 00:08:48,490 As I've mentioned before, when a digraph is strongly connected, 176 00:08:48,490 --> 00:08:50,950 that is a sufficient condition for there 177 00:08:50,950 --> 00:08:54,130 to be a unique stable distribution. 178 00:08:54,130 --> 00:08:57,210 That's actually proved in one of the exercises in the text 179 00:08:57,210 --> 00:09:00,620 at the end of the chapter. 180 00:09:00,620 --> 00:09:03,770 The super-node mechanism also ensures 181 00:09:03,770 --> 00:09:07,570 something even stronger, that every initial distribution 182 00:09:07,570 --> 00:09:12,780 p converges to the stationary distribution, 183 00:09:12,780 --> 00:09:14,530 to that unique stationary distribution. 184 00:09:14,530 --> 00:09:18,420 Stated precisely mathematically, if you start off 185 00:09:18,420 --> 00:09:21,780 at an arbitrary distribution of probabilities 186 00:09:21,780 --> 00:09:24,350 of being in different states, p, and you 187 00:09:24,350 --> 00:09:28,360 look at what happens to p after t steps-- remember, 188 00:09:28,360 --> 00:09:33,470 that you get by multiplying the vector p by the matrix M raised 189 00:09:33,470 --> 00:09:36,460 to the power t-- and you take the limit as t approaches 190 00:09:36,460 --> 00:09:40,080 infinity, that is to say, what distribution 191 00:09:40,080 --> 00:09:44,580 do you approach as you do more and more updates. 192 00:09:44,580 --> 00:09:46,520 And it turns out that that limit exists, 193 00:09:46,520 --> 00:09:48,399 and it is that stationary distribution. 194 00:09:48,399 --> 00:09:49,940 So it doesn't matter where you start, 195 00:09:49,940 --> 00:09:51,577 you're going to wind up stable. 196 00:09:51,577 --> 00:09:53,660 And as a matter of fact, the convergence is rapid. 197 00:09:53,660 --> 00:09:55,980 What that means is that you can actually 198 00:09:55,980 --> 00:09:58,630 calculate the stable distribution reasonably 199 00:09:58,630 --> 00:10:01,590 quickly, because you don't need a very large t in order 200 00:10:01,590 --> 00:10:03,780 to arrive at a very good approximation 201 00:10:03,780 --> 00:10:06,390 to the stable distribution. 202 00:10:06,390 --> 00:10:09,480 Now the actual Google rank and ranking 203 00:10:09,480 --> 00:10:11,900 is more complicated than just PageRank. 204 00:10:11,900 --> 00:10:15,120 PageRank was the original idea that got a lot of attention. 205 00:10:15,120 --> 00:10:17,200 And in fact, the latest information from Google 206 00:10:17,200 --> 00:10:20,190 is that they think it gets overattention today 207 00:10:20,190 --> 00:10:23,840 in the modern world by too many commentators and people trying 208 00:10:23,840 --> 00:10:26,420 to simulate ranking. 209 00:10:26,420 --> 00:10:30,070 So the actual rank rules are a closely-held trade secret 210 00:10:30,070 --> 00:10:32,070 for Google-- by Google. 211 00:10:32,070 --> 00:10:34,790 They use text, they use location, they use payments, 212 00:10:34,790 --> 00:10:37,450 because advertisers can pay to have their search 213 00:10:37,450 --> 00:10:41,730 results listed more prominently, and lots of other criteria 214 00:10:41,730 --> 00:10:43,750 that have evolved over 15 years. 215 00:10:43,750 --> 00:10:44,950 And they continue to evolve. 216 00:10:44,950 --> 00:10:47,670 As people find ways to manipulate the ranking, 217 00:10:47,670 --> 00:10:51,730 Google revises its ranking criteria and algorithms. 218 00:10:51,730 --> 00:10:54,140 But nevertheless, PageRank continues 219 00:10:54,140 --> 00:10:58,250 to play a significant role in the whole story.