Multi-document summarization
Encyclopedia
Multi-document summarization is an automatic procedure aimed at extraction of information from multiple texts written about the same topic. Resulting summary report allows individual users, so as professional information consumers, to quickly familiarize themselves with information contained in a large cluster of documents. In such a way, multi-document summarization systems are complementing the news aggregators performing the next step down the road of coping with information overload
.
With different opinions being put together & outlined, every topic is described from multiple perspectives within a single document.
While the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source documents, comprehensive multi-document summary should itself contain the required information, hence limiting the need for accessing original files to cases when refinement is required.
Automatic summaries present information extracted from multiple sources algorithmically, without any editorial touch or subjective human intervention, thus making it completely unbiased.
, even a very large one. This difficulty arises from inevitable thematic diversity within a large set of documents. A good summarization technology aims to combine the main themes with completeness, readability, and conciseness. Document Understanding Conferences, conducted annually by NIST, have developed sophisticated evaluation criteria for techniques accepting the multi-document summarization challenge.
An ideal multi-document summarization system does not simply shorten the source texts but presents information organized around the key aspects to represent a wider diversity of views on the topic. When such quality is achieved, an automatic multi-document summary is perceived more like an overview of a given topic. The latter implies that such text compilations should also meet other basic requirements for an overview text compiled by a human. The multi-document summary quality criteria are as follows:
The latter point deserves additional note - special care is taken in order to ensure that the automatic overview shows:
As the quality multi-document summaries are becoming to resemble the overviews written by a human, one cannot exclude that their use of extracted text snippets can one day face some copyright
issues. This potential case should be regarded from the point of the fair use
copyright concept.
Information overload
"Information overload" is a term popularized by Alvin Toffler in his bestselling 1970 book Future Shock. It refers to the difficulty a person can have understanding an issue and making decisions that can be caused by the presence of too much information...
.
Key benefits
Multi-document summarization creates information reports that are both concise and comprehensive.With different opinions being put together & outlined, every topic is described from multiple perspectives within a single document.
While the goal of a brief summary is to simplify information search and cut the time by pointing to the most relevant source documents, comprehensive multi-document summary should itself contain the required information, hence limiting the need for accessing original files to cases when refinement is required.
Automatic summaries present information extracted from multiple sources algorithmically, without any editorial touch or subjective human intervention, thus making it completely unbiased.
Technological challenges
The multi-document summarization task has turned out to be much more complex than summarizing a single documentAutomatic summarization
Automatic summarization is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text....
, even a very large one. This difficulty arises from inevitable thematic diversity within a large set of documents. A good summarization technology aims to combine the main themes with completeness, readability, and conciseness. Document Understanding Conferences, conducted annually by NIST, have developed sophisticated evaluation criteria for techniques accepting the multi-document summarization challenge.
An ideal multi-document summarization system does not simply shorten the source texts but presents information organized around the key aspects to represent a wider diversity of views on the topic. When such quality is achieved, an automatic multi-document summary is perceived more like an overview of a given topic. The latter implies that such text compilations should also meet other basic requirements for an overview text compiled by a human. The multi-document summary quality criteria are as follows:
- clear structure, including an outline of the main content, from which it is easy to navigate to the full text sections
- text within sections is divided into meaningful paragraphs
- gradual transition from more general to more specific thematic aspects
- good readability
The latter point deserves additional note - special care is taken in order to ensure that the automatic overview shows:
- no paper-unrelated "information noise" from the respective documents (e.g., web pages)
- no dangling references to what is not mentioned or explained in the overview
- no text breaks across a sentence
- no semantic redundancy.
Real-life systems
The multi-document summarization technology is now coming of age - a view supported by a choice of advanced web-based systems that are currently available.- Ultimate Research Assistant - The Ultimate Research Assistant performs text mining on Internet search results to help summarize and organize them and make it easier for the user to perform online research. Specific text mining techniques used by the tool include concept extraction, text summarization, hierarchical concept clustering (e.g., automated taxonomy generation), and various visualization techniques, including tag clouds and mind maps. To use this tool, the user types in the name of a topic, and the tool will search the web for highly relevant resources, and organize the search results into a rich, easy-to-understand research report.
- iResearch Reporter - Commercial Text Extraction and Text Summarization system, free demo site accepts user-entered query, passes it on to Google search engine, retrieves multiple relevant documents, produces categorized, easily-readable natural language summary reports covering multiple documents in retrieved set, all extracts linked to original documents on the Web, post-processing, entity extraction, event and relationship extraction, text extraction, extract clustering, linguistic analysis, multi-document, full text, natural language processing, categorization rules, clustering, linguistic analysis, text summary construction tool set.
- Newsblaster is a system that helps users find the news that is of the most interest to them. The system automatically collects, clusters, categorizes, and summarizes news from several sites on the web (CNNCNNCable News Network is a U.S. cable news channel founded in 1980 by Ted Turner. Upon its launch, CNN was the first channel to provide 24-hour television news coverage, and the first all-news television channel in the United States...
, ReutersReutersReuters is a news agency headquartered in New York City. Until 2008 the Reuters news agency formed part of a British independent company, Reuters Group plc, which was also a provider of financial market data...
, Fox News, etc.) on a daily basis, and it provides users a user-friendly interface to browse the results. - NewsInEssence may be used to retrieve and summarize a cluster of articles from the web. It can start from a URLUniform Resource LocatorIn computing, a uniform resource locator or universal resource locator is a specific character string that constitutes a reference to an Internet resource....
and retrieve documents that are similar, or it can retrieve documents that match a given set of keywords. NewsInEssence also downloads hundreds of news articles daily and produces news clusters from them. - NewsFeed Researcher is a news portal performing continuous automatic summarizationAutomatic summarizationAutomatic summarization is the creation of a shortened version of a text by a computer program. The product of this procedure still contains the most important points of the original text....
of documents initially clustered by the news aggregators (e.g., Google NewsGoogle NewsGoogle News is a free news aggregator provided by Google Inc, selecting recent items from thousands of publications by an automatic aggregation algorithm....
). NewsFeed Researcher is backed by the free online engine covering major events related to business, technology, U.S. and international news. This tool is also available in the on-demand mode allowing a user to build a summary on any selected topic. - Shablast is a universal search engine that produces multi-document summaries from the top 50 results returned by Microsoft's Bing search engine for a set of keywords.
As the quality multi-document summaries are becoming to resemble the overviews written by a human, one cannot exclude that their use of extracted text snippets can one day face some copyright
Copyright
Copyright is a legal concept, enacted by most governments, giving the creator of an original work exclusive rights to it, usually for a limited time...
issues. This potential case should be regarded from the point of the fair use
Fair use
Fair use is a limitation and exception to the exclusive right granted by copyright law to the author of a creative work. In United States copyright law, fair use is a doctrine that permits limited use of copyrighted material without acquiring permission from the rights holders...
copyright concept.