Spam mass
Encyclopedia
Spam mass is defined as "the measure of the impact of link spamming
Spamdexing
In computing, spamdexing is the deliberate manipulation of search engine indexes...

 on a page's ranking." The concept was developed by Zoltán Gyöngyi and Hector Garcia-Molina
Hector Garcia-Molina
Héctor García-Molina is a Professor in the Departments of Computer Science and ElectricalEngineering at Stanford University. He has served at the U.S...

 of Stanford University
Stanford University
The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is a private research university on an campus located near Palo Alto, California. It is situated in the northwestern Santa Clara Valley on the San Francisco Peninsula, approximately northwest of San...

 in association with Pavel Berkhin and Jan Pedersen of Yahoo!
Yahoo!
Yahoo! Inc. is an American multinational internet corporation headquartered in Sunnyvale, California, United States. The company is perhaps best known for its web portal, search engine , Yahoo! Directory, Yahoo! Mail, Yahoo! News, Yahoo! Groups, Yahoo! Answers, advertising, online mapping ,...

. This paper expands upon their proposed TrustRank
TrustRank
TrustRank is a link analysis technique described in a paper by Stanford University and Yahoo! researchers for semi-automatically separating useful webpages from spam.Many Web spam pages are created only with the intention of misleading search engines...

 methodology.

The researchers developed a good core and a bad core of selected Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

 documents from which they measured spam mass across a collection of documents. Two types of measurements, absolute mass and relative mass, are used to compare groups of documents. The higher the mass measurements, the more likely the documents are to be equivalent to spam.

Thresholds

A threshold value is used to identify groups of documents as spam. If their relative mass value exceeds the threshold, the documents are considered to be spam. A second threshold for the PageRank
PageRank
PageRank is a link analysis algorithm, named after Larry Page and used by the Google Internet search engine, that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set...

values of the selected documents is applied. Only high PageRank documents are labelled as spam.

The purpose of the methodology is to identify spam documents with artificially inflated PageRank values.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK