PageRank
Encyclopedia
PageRank is a link analysis algorithm, named after Larry Page
and used by the Google
Internet search engine
, that assigns a numerical weighting to each element of a hyperlink
ed set
of documents, such as the World Wide Web
, with the purpose of "measuring" its relative importance within the set. The algorithm
may be applied to any collection of entities with reciprocal
quotations and references. The numerical weight that it assigns to any given element E is referred to as the PageRank of E and denoted by
The name "PageRank" is a trademark
of Google, and the PageRank process has been patent
ed . However, the patent is assigned to Stanford University
and not to Google. Google has exclusive license rights on the patent from Stanford University. The university received 1.8 million shares of Google in exchange for use of the patent; the shares were sold in 2005 for $
336 million.
, the webgraph
, created by all World Wide Web pages as nodes and hyperlink
s as edges, taking into consideration authority hubs such as cnn.com or usa.gov
. The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively
and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page there is no support for that page.
Numerous academic papers concerning PageRank have been published since Page and Brin's original paper. In practice, the PageRank concept has proven to be vulnerable to manipulation, and extensive research has been devoted to identifying falsely inflated PageRank and ways to ignore links from documents with falsely inflated PageRank.
Other linkbased ranking algorithms for Web pages include the HITS algorithm
invented by Jon Kleinberg
(used by Teoma
and now Ask.com
), the IBM CLEVER project
, and the TrustRank
algorithm.
by Larry Page
(hence the name PageRank) and Sergey Brin
as part of a research project about a new kind of search engine. Sergey Brin had the idea that information on the web could be ordered in a hierarchy by "link popularity": a page is ranked higher as there are more links to it. It was coauthored by Rajeev Motwani
and Terry Winograd
. The first paper about the project, describing PageRank and the initial prototype of the Google search
engine, was published in 1998: shortly after, Page and Brin founded Google Inc., the company behind the Google search engine. While just one of many factors that determine the ranking of Google search results, PageRank continues to provide the basis for all of Google's web search tools.
PageRank has been influenced by citation analysis
, early developed by Eugene Garfield
in the 1950s at the University of Pennsylvania, and by Hyper Search
, developed by Massimo Marchiori
at the University of Padua. In the same year PageRank was introduced (1998), Jon Kleinberg
published his important work on HITS
. Google's founders cite Garfield, Marchiori, and Kleinberg in their original paper.
A small search engine called "RankDex" from IDD Information Services designed by Robin Li
was, since 1996, already exploring a similar strategy for sitescoring and page ranking. The technology in RankDex would be patented by 1999 and used later when Li founded Baidu
in China. Li's work would be referenced by some of Larry Page's U.S. patents for his Google search methods.
used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes, called "iterations", through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.
A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is commonly expressed as a "50% chance" of something happening. Hence, a PageRank of 0.5 means there is a 50% chance that a person clicking on a random link will be directed to the document with the 0.5 PageRank.
In the original form of PageRank initial values were simply 1. This meant that the sum of all pages was the total number of pages on the web at that time. Later versions of PageRank (see the formulas below) would assume a probability distribution between 0 and 1. Here a simple probability distribution will be used—hence the initial value of 0.25.
If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A. All PageRank PR in this simplistic system would thus gather to A because all links would be pointing to A.
This is 0.75.
Suppose that page B has a link to page C as well as to page A, while page D has links to all three pages. The value of the linkvotes is divided among all the outbound links on a page. Thus, page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C. Only one third of D' s PageRank is counted for A's PageRank (approximately 0.083).
In other words, the PageRank conferred by an outbound link is equal to the document's own PageRank score divided by the normalized number of outbound links L (it is assumed that links to specific URLs only count once per document).
In the general case, the PageRank value for any page u can be expressed as:
,
i.e. the PageRank value for a page u is dependent on the PageRank values for each page v out of the set B_{u} (this set contains all pages linking to page u), divided by the number L(v) of links from page v.
The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the number of documents (N) in the collection) and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores. That is,
So any page's PageRank is derived in large part from the PageRanks of other pages. The damping factor adjusts the derived value downward. The original paper, however, gave the following formula, which has led to some confusion:
The difference between them is that the PageRank values in the first formula sum to one, while in the second formula each PageRank gets multiplied by N and the sum becomes N. A statement in Page and Brin's paper that "the sum of all PageRanks is one" and claims by other Google employees support the first variant of the formula above.
To be more specific, in the latter formula, the probability for the random surfer reaching a page is weighted by the total number of web pages. So, in this version PageRank is an expected value for the random surfer visiting a page, when he restarts this procedure as often as the web has pages. If the web had 100 pages and a page had a PageRank value of 2, the random surfer would reach that page in an average twice if he restarts 100 times. Basically, the two formulas do not differ fundamentally from each other. A PageRank that has been calculated by using the former formula has to be multiplied by the total number of web pages to get the according PageRank that would have been calculated by using the latter formula. Even Page and Brin mixed up the two formulas in their most popular paper "The Anatomy of a LargeScale Hypertextual Web Search Engine", where they claim the latter formula to form a probability distribution over web pages with the sum of all pages' PageRanks being one.
Google recalculates PageRank scores each time it crawls the Web and rebuilds its index. As Google increases the number of documents in its collection, the initial approximation of PageRank decreases for all documents.
The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the chance that the random surfer will land on that page by clicking on a link. It can be understood as a Markov chain
in which the states are pages, and the transitions are all equally probable and are the links between pages.
If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL
at random and continues surfing again.
When calculating PageRank, pages with no outbound links are assumed to link out to all other pages in the collection. Their PageRank scores are therefore divided evenly among all other pages. In other words, to be fair with pages that are not sinks, these random transitions are added to all nodes in the Web, with a residual probability of usually d = 0.85, estimated from the frequency that an average surfer uses his or her browser's bookmark feature.
So, the equation is as follows:
where are the pages under consideration, is the set of pages that link to , is the number of outbound links on page , and N is the total number of pages.
The PageRank values are the entries of the dominant eigenvector of the modified adjacency matrix
. This makes PageRank a particularly elegant metric: the eigenvector is
where R is the solution of the equation
where the adjacency function is 0 if page does not link to , and normalized such that, for each j
,
i.e. the elements of each column sum up to 1, so the matrix is a stochastic matrix
(for more details see the computation section below). Thus this is a variant of the eigenvector centrality measure used commonly in network analysis
.
Because of the large eigengap of the modified adjacency matrix above, the values of the PageRank eigenvector are fast to approximate (only a few iterations are needed).
As a result of Markov theory
, it can be shown that the PageRank of a page is the probability of being at that page after lots of clicks. This happens to equal where is the expectation
of the number of clicks (or random jumps) required to get from the page back to itself.
The main disadvantage is that it favors older pages, because a new page, even a very good one, will not have many links unless it is part of an existing site (a site being a densely connected set of pages, such as Wikipedia
). The Google Directory (itself a derivative of the Open Directory Project
) allows users to see results sorted by PageRank within categories. The Google Directory is the only service offered by Google where PageRank directly determines display order. In Google's other search services (such as its primary Web search) PageRank is used to weight the relevance scores of pages shown in search results.
Several strategies have been proposed to accelerate the computation of PageRank.
Various strategies to manipulate PageRank have been employed in concerted efforts to improve search results rankings and monetize advertising links. These strategies have severely impacted the reliability of the PageRank concept, which seeks to determine which documents are actually highly valued by the Web community.
Google is known to penalize link farm
s and other schemes designed to artificially inflate PageRank. In December 2007 Google started actively penalizing sites selling paid text links. How Google identifies link farms and other PageRank manipulation tools are among Google's trade secret
s.
power iteration
method, or power method. The basic mathematical operations performed in the iterative method and the power method are identical.
At each time step, the computation, as detailed above, yields,
or in matrix notation, (*)
where
and is the column vector of length containing only ones.
The matrix is defined as
Larry Page
Lawrence "Larry" Page is an American computer scientist and internet entrepreneur who, with Sergey Brin, is best known as the cofounder of Google. As of April 4, 2011, he is also the chief executive of Google, as announced on January 20, 2011...
and used by the Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internetbased services and products, and generates profit primarily from advertising through its AdWords program...
Internet search engine
Search engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
, that assigns a numerical weighting to each element of a hyperlink
Hyperlink
In computing, a hyperlink is a reference to data that the reader can directly follow, or that is followed automatically. A hyperlink points to a whole document or to a specific element within a document. Hypertext is text with hyperlinks...
ed set
Set (computer science)
In computer science, a set is an abstract data structure that can store certain values, without any particular order, and no repeated values. It is a computer implementation of the mathematical concept of a finite set...
of documents, such as the World Wide Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...
, with the purpose of "measuring" its relative importance within the set. The algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of welldefined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
may be applied to any collection of entities with reciprocal
Reciprocal
In mathematics:*Multiplicative inverse, in mathematics, the number 1/x, which multiplied by x gives the product 1, also known as a reciprocal*Reciprocal rule, a technique in calculus for calculating derivatives of reciprocal functions...
quotations and references. The numerical weight that it assigns to any given element E is referred to as the PageRank of E and denoted by
The name "PageRank" is a trademark
Trademark
A trademark, trade mark, or trademark is a distinctive sign or indicator used by an individual, business organization, or other legal entity to identify that the products or services to consumers with which the trademark appears originate from a unique source, and to distinguish its products or...
of Google, and the PageRank process has been patent
Patent
A patent is a form of intellectual property. It consists of a set of exclusive rights granted by a sovereign state to an inventor or their assignee for a limited period of time in exchange for the public disclosure of an invention....
ed . However, the patent is assigned to Stanford University
Stanford University
The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is a private research university on an campus located near Palo Alto, California. It is situated in the northwestern Santa Clara Valley on the San Francisco Peninsula, approximately northwest of San...
and not to Google. Google has exclusive license rights on the patent from Stanford University. The university received 1.8 million shares of Google in exchange for use of the patent; the shares were sold in 2005 for $
United States dollar
The United States dollar , also referred to as the American dollar, is the official currency of the United States of America. It is divided into 100 smaller units called cents or pennies....
336 million.
Description
A PageRank results from a mathematical algorithm based on the graphGraph (mathematics)
In mathematics, a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected objects are represented by mathematical abstractions called vertices, and the links that connect some pairs of vertices are called edges...
, the webgraph
Webgraph
The webgraph describes the directed links between pages of the World Wide Web. A graph, in general, consists of several vertices, some pairs connected by edges. In a directed graph, edges are directed lines or arcs...
, created by all World Wide Web pages as nodes and hyperlink
Hyperlink
In computing, a hyperlink is a reference to data that the reader can directly follow, or that is followed automatically. A hyperlink points to a whole document or to a specific element within a document. Hypertext is text with hyperlinks...
s as edges, taking into consideration authority hubs such as cnn.com or usa.gov
USA.gov
USA.gov is the official web portal of the United States government. It is designed to improve the public’s interaction with the U.S. government by quickly directing website visitors to the services or information they are seeking, and by inviting the public to share ideas to improve government...
. The rank value indicates an importance of a particular page. A hyperlink to a page counts as a vote of support. The PageRank of a page is defined recursively
Recursion
Recursion is the process of repeating items in a selfsimilar way. For instance, when the surfaces of two mirrors are exactly parallel with each other the nested images that occur are a form of infinite recursion. The term has a variety of meanings specific to a variety of disciplines ranging from...
and depends on the number and PageRank metric of all pages that link to it ("incoming links"). A page that is linked to by many pages with high PageRank receives a high rank itself. If there are no links to a web page there is no support for that page.
Numerous academic papers concerning PageRank have been published since Page and Brin's original paper. In practice, the PageRank concept has proven to be vulnerable to manipulation, and extensive research has been devoted to identifying falsely inflated PageRank and ways to ignore links from documents with falsely inflated PageRank.
Other linkbased ranking algorithms for Web pages include the HITS algorithm
HITS algorithm
HyperlinkInduced Topic Search is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. It was a precursor to PageRank...
invented by Jon Kleinberg
Jon Kleinberg
External links:**** Stephen Ibaraki*Yury Lifshits,...
(used by Teoma
Teoma
Teoma, pronounced chawmuh , was an Internet search engine founded in 2000 by Professor Apostolos Gerasoulis and his colleagues at Rutgers University in New Jersey. Professor Tao Yang from the University of California, Santa Barbara coled technology R&D. Their research grew out of the 1998 DiscoWeb...
and now Ask.com
Ask.com
Ask is a Q&A focused search engine founded in 1996 by Garrett Gruener and David Warthen in Berkeley, California. The original software was implemented by Gary Chevsky from his own design. Warthen, Chevsky, Justin Grant, and others built the early AskJeeves.com website around that core engine...
), the IBM CLEVER project
CLEVER project
The CLEVER project was a research project in Web search led by Jon Kleinberg at IBM's Almaden Research Center.Techniques developed in CLEVER included various forms of link analysis, including the HITS algorithm....
, and the TrustRank
TrustRank
TrustRank is a link analysis technique described in a paper by Stanford University and Yahoo! researchers for semiautomatically separating useful webpages from spam.Many Web spam pages are created only with the intention of misleading search engines...
algorithm.
History
PageRank was developed at Stanford UniversityStanford University
The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is a private research university on an campus located near Palo Alto, California. It is situated in the northwestern Santa Clara Valley on the San Francisco Peninsula, approximately northwest of San...
by Larry Page
Larry Page
Lawrence "Larry" Page is an American computer scientist and internet entrepreneur who, with Sergey Brin, is best known as the cofounder of Google. As of April 4, 2011, he is also the chief executive of Google, as announced on January 20, 2011...
(hence the name PageRank) and Sergey Brin
Sergey Brin
Sergey Mikhaylovich Brin is a Russianborn American computer scientist and internet entrepreneur who, with Larry Page, cofounded Google, one of the largest internet companies. , his personal wealth is estimated to be $16.7 billion....
as part of a research project about a new kind of search engine. Sergey Brin had the idea that information on the web could be ordered in a hierarchy by "link popularity": a page is ranked higher as there are more links to it. It was coauthored by Rajeev Motwani
Rajeev Motwani
Rajeev Motwani was a professor of Computer Science at Stanford University whose research focused on theoretical computer science. He was an early advisor and supporter of companies including Google and PayPal, and a special advisor to Sequoia Capital. He was a winner of the Gödel Prize in...
and Terry Winograd
Terry Winograd
Terry Allen Winograd is an American professor of computer science at Stanford University, and codirector of the Stanford HumanComputer Interaction Group...
. The first paper about the project, describing PageRank and the initial prototype of the Google search
Google search
Google or Google Web Search is a web search engine owned by Google Inc. Google Search is the mostused search engine on the World Wide Web, receiving several hundred million queries each day through its various services....
engine, was published in 1998: shortly after, Page and Brin founded Google Inc., the company behind the Google search engine. While just one of many factors that determine the ranking of Google search results, PageRank continues to provide the basis for all of Google's web search tools.
PageRank has been influenced by citation analysis
Citation analysis
Citation analysis is the examination of the frequency, patterns, and graphs of citations in articles and books. It uses citations in scholarly works to establish links to other works or other researchers. Citation analysis is one of the most widely used methods of bibliometrics...
, early developed by Eugene Garfield
Eugene Garfield
Eugene "Gene" Garfield is an American scientist, one of the founders of bibliometrics and scientometrics. He received a PhD in Structural Linguistics from the University of Pennsylvania in 1961. Dr. Garfield was the founder of the Institute for Scientific Information , which was located in...
in the 1950s at the University of Pennsylvania, and by Hyper Search
Hyper Search
Hyper Search has been the first published technique to introduce link analysis for search engines, opening the way for the secondgeneration of search engines...
, developed by Massimo Marchiori
Massimo Marchiori
Massimo Marchiori is an Italian computer scientist who made major contributions to the development of the World Wide Web.Biography:...
at the University of Padua. In the same year PageRank was introduced (1998), Jon Kleinberg
Jon Kleinberg
External links:**** Stephen Ibaraki*Yury Lifshits,...
published his important work on HITS
HITS algorithm
HyperlinkInduced Topic Search is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. It was a precursor to PageRank...
. Google's founders cite Garfield, Marchiori, and Kleinberg in their original paper.
A small search engine called "RankDex" from IDD Information Services designed by Robin Li
Robin Li
Robin Li is a Chinese entrepreneur, cofounder of China's most popular search engine Baidu.Li studied information management at Peking University and the State University of New York, Buffalo. In 2000 he founded Baidu with Eric Xu...
was, since 1996, already exploring a similar strategy for sitescoring and page ranking. The technology in RankDex would be patented by 1999 and used later when Li founded Baidu
Baidu
Baidu, Inc. , simply known as Baidu and incorporated on January 18, 2000, is a Chinese web services company headquartered in the Baidu Campus in Haidian District, Beijing, People's Republic of China....
in China. Li's work would be referenced by some of Larry Page's U.S. patents for his Google search methods.
Algorithm
PageRank is a probability distributionProbability distribution
In probability theory, a probability mass, probability density, or probability distribution is a function that describes the probability of a random variable taking certain values....
used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. PageRank can be calculated for collections of documents of any size. It is assumed in several research papers that the distribution is evenly divided among all documents in the collection at the beginning of the computational process. The PageRank computations require several passes, called "iterations", through the collection to adjust approximate PageRank values to more closely reflect the theoretical true value.
A probability is expressed as a numeric value between 0 and 1. A 0.5 probability is commonly expressed as a "50% chance" of something happening. Hence, a PageRank of 0.5 means there is a 50% chance that a person clicking on a random link will be directed to the document with the 0.5 PageRank.
Simplified algorithm
Assume a small universe of four web pages: A, B, C and D. The initial approximation of PageRank would be evenly divided between these four documents. Hence, each document would begin with an estimated PageRank of 0.25.In the original form of PageRank initial values were simply 1. This meant that the sum of all pages was the total number of pages on the web at that time. Later versions of PageRank (see the formulas below) would assume a probability distribution between 0 and 1. Here a simple probability distribution will be used—hence the initial value of 0.25.
If pages B, C, and D each only link to A, they would each confer 0.25 PageRank to A. All PageRank PR in this simplistic system would thus gather to A because all links would be pointing to A.
This is 0.75.
Suppose that page B has a link to page C as well as to page A, while page D has links to all three pages. The value of the linkvotes is divided among all the outbound links on a page. Thus, page B gives a vote worth 0.125 to page A and a vote worth 0.125 to page C. Only one third of D
In other words, the PageRank conferred by an outbound link is equal to the document's own PageRank score divided by the normalized number of outbound links L (it is assumed that links to specific URLs only count once per document).
In the general case, the PageRank value for any page u can be expressed as:
,
i.e. the PageRank value for a page u is dependent on the PageRank values for each page v out of the set B_{u} (this set contains all pages linking to page u), divided by the number L(v) of links from page v.
Damping factor
The PageRank theory holds that even an imaginary surfer who is randomly clicking on links will eventually stop clicking. The probability, at any step, that the person will continue is a damping factor d. Various studies have tested different damping factors, but it is generally assumed that the damping factor will be set around 0.85.The damping factor is subtracted from 1 (and in some variations of the algorithm, the result is divided by the number of documents (N) in the collection) and this term is then added to the product of the damping factor and the sum of the incoming PageRank scores. That is,
So any page's PageRank is derived in large part from the PageRanks of other pages. The damping factor adjusts the derived value downward. The original paper, however, gave the following formula, which has led to some confusion:
The difference between them is that the PageRank values in the first formula sum to one, while in the second formula each PageRank gets multiplied by N and the sum becomes N. A statement in Page and Brin's paper that "the sum of all PageRanks is one" and claims by other Google employees support the first variant of the formula above.
To be more specific, in the latter formula, the probability for the random surfer reaching a page is weighted by the total number of web pages. So, in this version PageRank is an expected value for the random surfer visiting a page, when he restarts this procedure as often as the web has pages. If the web had 100 pages and a page had a PageRank value of 2, the random surfer would reach that page in an average twice if he restarts 100 times. Basically, the two formulas do not differ fundamentally from each other. A PageRank that has been calculated by using the former formula has to be multiplied by the total number of web pages to get the according PageRank that would have been calculated by using the latter formula. Even Page and Brin mixed up the two formulas in their most popular paper "The Anatomy of a LargeScale Hypertextual Web Search Engine", where they claim the latter formula to form a probability distribution over web pages with the sum of all pages' PageRanks being one.
Google recalculates PageRank scores each time it crawls the Web and rebuilds its index. As Google increases the number of documents in its collection, the initial approximation of PageRank decreases for all documents.
The formula uses a model of a random surfer who gets bored after several clicks and switches to a random page. The PageRank value of a page reflects the chance that the random surfer will land on that page by clicking on a link. It can be understood as a Markov chain
Markov chain
A Markov chain, named after Andrey Markov, is a mathematical system that undergoes transitions from one state to another, between a finite or countable number of possible states. It is a random process characterized as memoryless: the next state depends only on the current state and not on the...
in which the states are pages, and the transitions are all equally probable and are the links between pages.
If a page has no links to other pages, it becomes a sink and therefore terminates the random surfing process. If the random surfer arrives at a sink page, it picks another URL
Uniform Resource Locator
In computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....
at random and continues surfing again.
When calculating PageRank, pages with no outbound links are assumed to link out to all other pages in the collection. Their PageRank scores are therefore divided evenly among all other pages. In other words, to be fair with pages that are not sinks, these random transitions are added to all nodes in the Web, with a residual probability of usually d = 0.85, estimated from the frequency that an average surfer uses his or her browser's bookmark feature.
So, the equation is as follows:
where are the pages under consideration, is the set of pages that link to , is the number of outbound links on page , and N is the total number of pages.
The PageRank values are the entries of the dominant eigenvector of the modified adjacency matrix
Adjacency matrix
In mathematics and computer science, an adjacency matrix is a means of representing which vertices of a graph are adjacent to which other vertices...
. This makes PageRank a particularly elegant metric: the eigenvector is
where R is the solution of the equation
where the adjacency function is 0 if page does not link to , and normalized such that, for each j
,
i.e. the elements of each column sum up to 1, so the matrix is a stochastic matrix
Stochastic matrix
In mathematics, a stochastic matrix is a matrix used to describe the transitions of a Markov chain. It has found use in probability theory, statistics and linear algebra, as well as computer science...
(for more details see the computation section below). Thus this is a variant of the eigenvector centrality measure used commonly in network analysis
Network theory
Network theory is an area of computer science and network science and part of graph theory. It has application in many disciplines including statistical physics, particle physics, computer science, biology, economics, operations research, and sociology...
.
Because of the large eigengap of the modified adjacency matrix above, the values of the PageRank eigenvector are fast to approximate (only a few iterations are needed).
As a result of Markov theory
Markov process
In probability theory and statistics, a Markov process, named after the Russian mathematician Andrey Markov, is a timevarying random phenomenon for which a specific property holds...
, it can be shown that the PageRank of a page is the probability of being at that page after lots of clicks. This happens to equal where is the expectation
Expected value
In probability theory, the expected value of a random variable is the weighted average of all possible values that this random variable can take on...
of the number of clicks (or random jumps) required to get from the page back to itself.
The main disadvantage is that it favors older pages, because a new page, even a very good one, will not have many links unless it is part of an existing site (a site being a densely connected set of pages, such as Wikipedia
Wikipedia
Wikipedia is a free, webbased, collaborative, multilingual encyclopedia project supported by the nonprofit Wikimedia Foundation. Its 20 million articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone with access to the site,...
). The Google Directory (itself a derivative of the Open Directory Project
Open Directory Project
The Open Directory Project , also known as Dmoz , is a multilingual open content directory of World Wide Web links. It is owned by Netscape but it is constructed and maintained by a community of volunteer editors.ODP uses a hierarchical ontology scheme for organizing site listings...
) allows users to see results sorted by PageRank within categories. The Google Directory is the only service offered by Google where PageRank directly determines display order. In Google's other search services (such as its primary Web search) PageRank is used to weight the relevance scores of pages shown in search results.
Several strategies have been proposed to accelerate the computation of PageRank.
Various strategies to manipulate PageRank have been employed in concerted efforts to improve search results rankings and monetize advertising links. These strategies have severely impacted the reliability of the PageRank concept, which seeks to determine which documents are actually highly valued by the Web community.
Google is known to penalize link farm
Link farm
On the World Wide Web, a link farm is any group of web sites that all hyperlink to every other site in the group. Although some link farms can be created by hand, most are created through automated programs and services. A link farm is a form of spamming the index of a search engine...
s and other schemes designed to artificially inflate PageRank. In December 2007 Google started actively penalizing sites selling paid text links. How Google identifies link farms and other PageRank manipulation tools are among Google's trade secret
Trade secret
A trade secret is a formula, practice, process, design, instrument, pattern, or compilation of information which is not generally known or reasonably ascertainable, by which a business can obtain an economic advantage over competitors or customers...
s.
Computation
To summarize, PageRank can be either computed iteratively or algebraically. The iterative method can be viewed differently as thepower iteration
Power iteration
In mathematics, the power iteration is an eigenvalue algorithm: given a matrix A, the algorithm will produce a number λ and a nonzero vector v , such that Av = λv....
method, or power method. The basic mathematical operations performed in the iterative method and the power method are identical.
Iterative
In the former case, at , an initial probability distribution is assumed, usually.At each time step, the computation, as detailed above, yields,
or in matrix notation, (*)
where
and is the column vector of length containing only ones.
The matrix is defined as

i.e.,,
where
denotes the adjacency matrixAdjacency matrixIn mathematics and computer science, an adjacency matrix is a means of representing which vertices of a graph are adjacent to which other vertices...
of the graph and is the diagonal matrix with the outdegrees in the diagonal.
The computation ends when for some small ,
i.e., when convergence is assumed.
Algebraic
In the latter case, for (i.e., in the steady stateSteady stateA system in a steady state has numerous properties that are unchanging in time. This implies that for any property p of the system, the partial derivative with respect to time is zero:...
), the above equation (*) reads. (**)
The solution is given by,
with the identity matrixIdentity matrixIn linear algebra, the identity matrix or unit matrix of size n is the n×n square matrix with ones on the main diagonal and zeros elsewhere. It is denoted by In, or simply by I if the size is immaterial or can be trivially determined by the context...
.
The solution exists and is unique for . This can be seen by noting that is by construction a stochastic matrixStochastic matrixIn mathematics, a stochastic matrix is a matrix used to describe the transitions of a Markov chain. It has found use in probability theory, statistics and linear algebra, as well as computer science...
and hence has an eigenvalue equal to one because of the Perron–Frobenius theoremPerron–Frobenius theoremIn linear algebra, the Perron–Frobenius theorem, proved by and , asserts that a real square matrix with positive entries has a unique largest real eigenvalue and that the corresponding eigenvector has strictly positive components, and also asserts a similar statement for certain classes of...
.
Power Method
If the matrix is a transition probability, i.e., columnstochastic with no columns consisting of
just zeros and is a probability distribution (i.e., , where is matrix of all ones), Eq. (**) is equivalent to. (***)
Hence PageRank is the principal eigenvector of . A fast and easy
way to compute this is using the power methodPower iterationIn mathematics, the power iteration is an eigenvalue algorithm: given a matrix A, the algorithm will produce a number λ and a nonzero vector v , such that Av = λv....
: starting with an arbitrary vector , the operator is applied in succession, i.e.,,
until.
Note that in Eq. (***) the matrix on the righthand side in the parenthesis can be interpreted as,
where is an initial probability distribution. In the current case.
Finally, if has columns with only zero values, they should be replaced with the initial
probability vector
. In other words,
where the matrix is defined as,
with
In this case, the above two computations using only give the same PageRank if their
results are normalized: .
PageRank matlab/octave implementation