HITS algorithm
Encyclopedia
Hyperlink-Induced Topic Search (HITS) (also known as Hubs and authorities) is a link analysis
Link Analysis
In network theory, link analysis is a data-analysis technique used to evaluate relationships between nodes. Relationships may be identified among various types of nodes , including organizations, people and transactions...

 algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

 that rates Web pages, developed by Jon Kleinberg
Jon Kleinberg
-External links:**** Stephen Ibaraki*Yury Lifshits,...

. It was a precursor to PageRank
PageRank
PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set...

. The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that it held, but were used as compilations of a broad catalog of information that led users directly to other authoritative pages. In other words, a good hub represented a page that pointed to many other pages, and a good authority represented a page that was linked by many different hubs.

The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.

In Journals

Formerly, many methods were used for ranking the importance of scientific journals. One such method was Garfield's impact factor
Impact factor
The impact factor, often abbreviated IF, is a measure reflecting the average number of citations to articles published in science and social science journals. It is frequently used as a proxy for the relative importance of a journal within its field, with journals with higher impact factors deemed...

. However, many journals such as Science and Nature are filled with numerous citations, making these magazines have very high impact factors. Thus, when comparing two more obscure journals which have received roughly the same number of citations but one of these journals has received many citations from Science and Nature, this journal needs be ranked higher. In other words, it is better to receive citations from an important journal than from an unimportant one.

On the Web

This phenomenon also occurs in the Internet
Internet
The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...

. Counting the number of links to a page can give us a general estimate of its prominence on the Web, but a page with very few incoming links may also be prominent, if two of these links come from the home pages of Yahoo!
Yahoo!
Yahoo! Inc. is an American multinational internet corporation headquartered in Sunnyvale, California, United States. The company is perhaps best known for its web portal, search engine , Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo! Groups, Yahoo! Answers, advertising, online mapping ,...

 or Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

 or MSN
MSN
MSN is a collection of Internet sites and services provided by Microsoft. The Microsoft Network debuted as an online service and Internet service provider on August 24, 1995, to coincide with the release of the Windows 95 operating system.The range of services offered by MSN has changed since its...

. Thus, because these sites are of very high importance but are also Search Engines, there can be very irrelevant results.

Algorithm

In the HITS algorithm, the first step is to retrieve the set of results to the search query. The computation is performed only on this result set, not across all Web pages.

Authority and hub values are defined in terms of one another in a mutual recursion
Mutual recursion
Mutual recursion is a form of recursion where two mathematical or computational functions are defined in terms of each other.For instance, consider two functions even? and odd? defined as follows: function even?...

. An authority value is computed as the sum of the scaled hub values that point to that page. A hub value is the sum of the scaled authority values of the pages it points to. Some implementations also consider the relevance of the linked pages.

The algorithm performs a series of iterations, each consisting of two basic steps:
  • Authority Update: Update each node's Authority score to be equal to the sum of the Hub Scores of each node that points to it. That is, a node is given a high authority score by being linked to by pages that are recognized as Hubs for information.
  • Hub Update: Update each node's Hub Score to be equal to the sum of the Authority Scores of each node that it points to. That is, a node is given a high hub score by linking to nodes that are considered to be authorities on the subject.


The Hub score and Authority score for a node is calculated with the following algorithm:
  • Start with each node having a hub score and authority score of 1.
  • Run the Authority Update Rule
  • Run the Hub Update Rule
  • Normalize the values by dividing each Hub score by the sum of the squares of all Hub scores, and dividing each Authority score by the sum of the squares of all Authority scores.
  • Repeat from the second step as necessary.


HITS, like Page
Larry Page
Lawrence "Larry" Page is an American computer scientist and internet entrepreneur who, with Sergey Brin, is best known as the co-founder of Google. As of April 4, 2011, he is also the chief executive of Google, as announced on January 20, 2011...

 and Brin
Sergey Brin
Sergey Mikhaylovich Brin is a Russian-born American computer scientist and internet entrepreneur who, with Larry Page, co-founded Google, one of the largest internet companies. , his personal wealth is estimated to be $16.7 billion....

's PageRank
PageRank
PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set...

, is an iterative algorithm based on the linkage of the documents on the web. However it does have some major differences:
  • It is executed at query time, not at indexing time, with the associated hit on performance that accompanies query-time processing. Thus, the hub and authority scores assigned to a page are query-specific.
  • It is not commonly used by search engines. (Though a similar algorithm was said to be used by Teoma
    Teoma
    Teoma, pronounced chawmuh , was an Internet search engine founded in 2000 by Professor Apostolos Gerasoulis and his colleagues at Rutgers University in New Jersey. Professor Tao Yang from the University of California, Santa Barbara co-led technology R&D. Their research grew out of the 1998 DiscoWeb...

     , which was acquired by Ask.com
    Ask.com
    Ask is a Q&A focused search engine founded in 1996 by Garrett Gruener and David Warthen in Berkeley, California. The original software was implemented by Gary Chevsky from his own design. Warthen, Chevsky, Justin Grant, and others built the early AskJeeves.com website around that core engine...

    .)
  • It computes two scores per document, hub and authority, as opposed to a single score.
  • It is processed on a small subset of ‘relevant’ documents, not all documents as was the case with PageRank.

In detail

To begin the ranking, , and . We consider two types of updates: Authority Update Rule and Hub Update Rule. In order to calculate the hub/authority scores of each node, repeated iterations of the Authority Update Rule and the Hub Update Rule are applied. A k-step application of the Hub-Authority algorithm entails applying for k times first the Authority Update Rule and then the Hub Update Rule.

Authority Update Rule

, we update to be:



where n is the total number of pages connected to p and i is a page connected to p. That is, the Authority score of a page is the sum of all the Hub scores of pages that point to it.

Hub Update Rule

, we update to be:



where n is the total number of pages p connects to and i is a page which p connects to. Thus a page's Hub score is the sum of the Authority scores of all its linking pages

Normalization

The final hub-authority scores of nodes are determined after infinite repetitions of the algorithm. As directly and iteratively applying the Hub Update Rule and Authority Update Rule leads to diverging values, it is necessary to normalize the matrix after every iteration. Thus the values obtained from this process will eventually converge.

Pseudocode

1 G := set of pages
2 for each page p in G do
3 p.auth = 1 // p.auth is the authority score of the page p
4 p.hub = 1 // p.hub is the hub score of the page p
5 function HubsAndAuthorities(G)
6 for step from 1 to k do // run the algorithm for k steps
7 norm = 0
8 for each page p in G do // update all authority values first
9 p.auth = 0
10 for each page q in p.incomingNeighbors do // p.incomingNeighbors is the set of pages that link to p
11 p.auth += q.hub
12 norm += square(p.auth) // calculate the sum of the squared auth values to normalise
13 norm = sqrt(norm)
14 for each page p in G do // update the auth scores
15 p.auth = p.auth / norm // normalise the auth values
16 norm = 0
17 for each page p in G do // then update all hub values
18 p.hub = 0
19 for each page r in p.outgoingNeighbors do // p.outgoingNeighbors is the set of pages that p links to
20 p.hub += r.auth
21 norm += square(p.hub) // calculate the sum of the squared hub values to normalise
22 norm = sqrt(norm)
23 for each page p in G do // then update all hub values
24 p.hub = p.hub / norm // normalise the hub values

The hub and authority values converge in the pseudocode above.

The code below does not converge, because it is necessary to limit the number of steps that the algorithm runs for. One way to get around this, however, would be to normalize the hub and authority values after each "step" by dividing each authority value by the square root of the sum of the squares of all authority values, and dividing each hub value by the square root of the sum of the squares of all hub values. This is what the pseudocode above does.

Non-converging pseudocode

1 G := set of pages
2 for each page p in G do
3 p.auth = 1 // p.auth is the authority score of the page p
4 p.hub = 1 // p.hub is the hub score of the page p
5 function HubsAndAuthorities(G)
6 for step from 1 to k do // run the algorithm for k steps
7 for each page p in G do // update all authority values first
8 p.auth = 0
9 for each page q in p.incomingNeighbors do // p.incomingNeighbors is the set of pages that link to p
10 p.auth += q.hub
11 for each page p in G do // then update all hub values
12 p.hub = 0
13 for each page r in p.outgoingNeighbors do // p.outgoingNeighbors is the set of pages that p links to
14 p.hub += r.auth
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK