Document clustering
Encyclopedia
Document clustering is closely related to the concept of data clustering
. Document clustering is a more specific technique for unsupervised document organization, automatic topic
extraction and fast information retrieval
or filtering.
A web search engine
often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by Enterprise Search engines such as Northern Light
and Vivisimo
, consumer search engines such as PolyMeta and Helioid, or open source software such as Carrot2
.
Example:
FirstGov.gov, the official Web portal for the U.S. government, uses document clustering to automatically organize its search results into categories. For example, if a user submits “immigration”, next to their list of results they will see categories for “Immigration Reform”, “Citizenship and Immigration Services”, “Employment”, “Department of Homeland Security”, and more. Perform Probabilistic Latent Semantic Analysis (PLSA) can also be conducted to perform document clustering.
Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.
The application of document clustering can be categorized to two types. The online application is usually constrained by the efficiency compared offline applications.
In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and ward's method. By aggregating or dividing, documents could be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from the efficiency problems. The other algorithm is developed with K-means algorithm and its variances. Usually, it shows a better efficiency, but it is less accurate than the hierarchical algorithm.
Other algorithms involve graph based clustering, ontology supported clustering and order sensitive clustering.
Data clustering
Cluster analysis or clustering is the task of assigning a set of objects into groups so that the objects in the same cluster are more similar to each other than to those in other clusters....
. Document clustering is a more specific technique for unsupervised document organization, automatic topic
Topic
Topic or Topicality may refer to:* Topic , what is being talked about* Topic * Topic , a brand of confectionery bar* Topics , a work by Aristotle* Topical, a medication applied to body surfaces...
extraction and fast information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
or filtering.
A web search engine
Web search engine
A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results often referred to as SERPS, or "search engine results pages". The information may consist of web pages, images, information and other...
often returns thousands of pages in response to a broad query, making it difficult for users to browse or to identify relevant information. Clustering methods can be used to automatically group the retrieved documents into a list of meaningful categories, as is achieved by Enterprise Search engines such as Northern Light
Northern Light Group
Northern Light Group, LLC is a company specializing in strategic research portals, enterprise search technology, and text analytics solutions. The company provides custom, hosted, turnkey solutions for its clients using the software as a service delivery model. Northern Light markets its...
and Vivisimo
Vivísimo
Vivisimo is a privately held enterprise search software company in Pittsburgh that develops and sells software products to improve search on the web and in enterprises...
, consumer search engines such as PolyMeta and Helioid, or open source software such as Carrot2
Carrot2
Carrot² is an open source search results clustering engine. It can automatically cluster small collections of documents, e.g. search results or document abstracts, into thematic categories. Apart from two specialized search results clustering algorithms, Carrot² offers ready-to-use components for...
.
Example:
FirstGov.gov, the official Web portal for the U.S. government, uses document clustering to automatically organize its search results into categories. For example, if a user submits “immigration”, next to their list of results they will see categories for “Immigration Reform”, “Citizenship and Immigration Services”, “Employment”, “Department of Homeland Security”, and more. Perform Probabilistic Latent Semantic Analysis (PLSA) can also be conducted to perform document clustering.
Document clustering involves the use of descriptors and descriptor extraction. Descriptors are sets of words that describe the contents within the cluster. Document clustering is generally considered to be a centralized process. Examples of document clustering include web document clustering for search users.
The application of document clustering can be categorized to two types. The online application is usually constrained by the efficiency compared offline applications.
In general, there are two common algorithms. The first one is the hierarchical based algorithm, which includes single link, complete linkage, group average and ward's method. By aggregating or dividing, documents could be clustered into hierarchical structure, which is suitable for browsing. However, such an algorithm usually suffers from the efficiency problems. The other algorithm is developed with K-means algorithm and its variances. Usually, it shows a better efficiency, but it is less accurate than the hierarchical algorithm.
Other algorithms involve graph based clustering, ontology supported clustering and order sensitive clustering.
Further reading
Publications:- Nicholas O. Andrews and Edward A. Fox, Recent Developments in Document Clustering, October 16, 2007 http://eprints.cs.vt.edu/archive/00001000/01/docclust.pdf
- Claudio Carpineto, Stanislaw Osiński, Giovanni Romano, Dawid Weiss. A survey of Web clustering engines. ACM Computing Surveys (CSUR), Volume 41, Issue 3 (July 2009), Article No. 17, ISSN:0360-0300