Text Retrieval Conference
Encyclopedia
The Text REtrieval Conference (TREC) is an on-going series of workshop
s focusing on a list of different information retrieval
(IR) research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology
(NIST) and the Intelligence Advanced Research Projects Activity
(part of the office of the Director of National Intelligence), and began in 1992 as part of the TIPSTER Text program
. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology
.
Each track has a challenge wherein NIST provides participating groups with data sets and test problems. Depending on track, test problems might be questions, topics, or target extractable features
. Uniform scoring is performed so the systems can be fairly evaluated. After evaluation of the results, a workshop provides a place for participants to collect together thoughts and ideas and present current and future research work.
Test Collection for IR Systems), and in 2000, a European counterpart was launched, called CLEF (Cross Language Evaluation Forum).
s. An independent report by RTII found that "about one-third of the improvement in web search engines from 1999 to 2009 is attributable to TREC. Those enhancements likely saved up to 3 billion hours of time using web search engines. ... Additionally, the report showed that for every $1 that NIST and its partners invested in TREC, at least $3.35 to $5.07 in benefits were accrued to U.S. information retrieval researchers in both the private sector and academia."
While one study suggests that the state of the art for "ad-hoc" search has not advanced substantially in the past decade, it is referring just to search for topically relevant documents in small news and web collections of a few gigabytes. There have been advances in other types of ad-hoc search in the past decade. For example, test collections were created for known-item web search which found improvements from the use of anchor text, title weighting and url length, which were not useful techniques on the older ad-hoc test collections. In 2009, a new billion-page web collection was introduced, and spam filtering was found to be a useful technique for ad-hoc web search, unlike in past test collections.
The test collections developed at TREC are useful not just for (potentially) helping researchers advance the state of the art, but also for allowing developers of new (commercial) retrieval products to evaluate their effectiveness on standard tests. In the past decade, TREC has created new tests for enterprise e-mail search, genomics search, spam filtering, e-Discovery, and several other retrieval domains.
TREC systems often provide a baseline for further research. Examples include:
Workshop
A workshop is a room or building which provides both the area and tools that may be required for the manufacture or repair of manufactured goods...
s focusing on a list of different information retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
(IR) research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology
National Institute of Standards and Technology
The National Institute of Standards and Technology , known between 1901 and 1988 as the National Bureau of Standards , is a measurement standards laboratory, otherwise known as a National Metrological Institute , which is a non-regulatory agency of the United States Department of Commerce...
(NIST) and the Intelligence Advanced Research Projects Activity
Intelligence Advanced Research Projects Activity
The Intelligence Advanced Research Projects Activity is a United States research agency under the Director of National Intelligence's responsibility...
(part of the office of the Director of National Intelligence), and began in 1992 as part of the TIPSTER Text program
DARPA TIPSTER Program
The DARPA TIPSTER Text program was started in 1991 by the Defense Advanced Research Projects Agency . It supported research to improve informational retrieval and extraction software and worked to deploy these improved technologies to government users...
. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology
Technology transfer
Technology Transfer, also called Transfer of Technology and Technology Commercialisation, is the process of skill transferring, knowledge, technologies, methods of manufacturing, samples of manufacturing and facilities among governments or universities and other institutions to ensure that...
.
Each track has a challenge wherein NIST provides participating groups with data sets and test problems. Depending on track, test problems might be questions, topics, or target extractable features
Features (pattern recognition)
In pattern recognition, features are the individual measurable heuristic properties of the phenomena being observed. Choosing discriminating and independent features is key to any pattern recognition algorithm being successful in classification...
. Uniform scoring is performed so the systems can be fairly evaluated. After evaluation of the results, a workshop provides a place for participants to collect together thoughts and ideas and present current and future research work.
Current Tracks
New tracks are added as new research needs are identified, this list is current for TREC 2011.- Chemical Track - Goal: to develop and evaluate technology for large scale search in chemistryChemistryChemistry is the science of matter, especially its chemical reactions, but also its composition, structure and properties. Chemistry is concerned with atoms and their interactions with other atoms, and particularly with the properties of chemical bonds....
-related documents, including academic papers and patents, to better meet the needs of professional searchers, and specifically patent searchers and chemists. - CrowdsourcingCrowdsourcingCrowdsourcing is the act of sourcing tasks traditionally performed by specific individuals to a group of people or community through an open call....
Track - Goal: to provide a collaborative venue for exploring crowdsourcingCrowdsourcingCrowdsourcing is the act of sourcing tasks traditionally performed by specific individuals to a group of people or community through an open call....
methods both for evaluating search and for performing search tasks. New for 2011. - Entity Track - Goal: to perform entityNamed entity recognitionNamed-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...
-related search on Web data. These search tasks (such as finding entities and properties of entities) address common information needs that are not that well modeled as ad hoc document search. - Legal Track - Goal: to develop search technology that meets the needs of lawyers to engage in effective discoveryDiscovery (law)In U.S.law, discovery is the pre-trial phase in a lawsuit in which each party, through the law of civil procedure, can obtain evidence from the opposing party by means of discovery devices including requests for answers to interrogatories, requests for production of documents, requests for...
in digital document collections. - Medical Records Track - Goal: to explore methods for searching unstructured information found in patient medical records. New for 2011.
- Microblog Track - Goal: to explore information seeking behavior in microblogs. New for 2011.
- Session Track - Goal: to develop methods for measuring multiple-query sessions where information needs drift or get more or less specific over the session.
- Web Track - Goal: to explore information seeking behaviors common in general web search.
Past tracks
- Genomics TrackTREC GenomicsThe TREC Genomics track was a workshop held under the auspices of NIST for the purpose of evaluating systems for information retrieval and related technologies in the genomics domain...
- Goal: to study the retrieval of genomicGenomicsGenomics is a discipline in genetics concerning the study of the genomes of organisms. The field includes intensive efforts to determine the entire DNA sequence of organisms and fine-scale genetic mapping efforts. The field also includes studies of intragenomic phenomena such as heterosis,...
data, not just gene sequences but also supporting documentation such as research papers, lab reports, etc. Last ran on TREC 2007. - Enterprise TrackEnterprise searchEnterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.-Enterprise search summary:...
- Goal: to study search over the data of an organization to complete some task. Last ran on TREC 2008. - Cross-LanguageCross-language information retrievalCross-language information retrieval is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query. For example, a user may pose their query in English but retrieve relevant documents written in French.The first...
Track - Goal: to investigate the ability of retrieval systems to find documents topically regardless of source language. - Filtering Track - Goal: to binarily decide retrieval of new incoming documents given a stable information need.
- HARD Track - Goal: to achieve High Accuracy Retrieval from Documents by leveraging additional information about the searcher and/or the search context.
- Interactive Track - Goal: to study user interaction with text retrieval systems.
- Novelty Track - Goal: to investigate systems' abilities to locate new (i.e., non-redundant) information.
- Question AnsweringQuestion answeringIn information retrieval and natural language processing , question answering is the task of automatically answering a question posed in natural language...
Track - Goal: to achieve more information retrievalInformation retrievalInformation retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
than just document retrievalDocument retrievalDocument retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual...
by answering factoid, list and definition-style questions. - Robust Retrieval Track - Goal: to focus on individual topic effectiveness.
- Relevance FeedbackRelevance feedbackRelevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query and to use information about whether or not those results are relevant to perform a new query...
Track - Goal: to further deep evaluation of relevance feedback processes. - SpamSpam (electronic)Spam is the use of electronic messaging systems to send unsolicited bulk messages indiscriminately...
Track - Goal: to provide a standard evaluation of current and proposed spam filtering approaches. - TerabyteTerabyteThe terabyte is a multiple of the unit byte for digital information. The prefix tera means 1012 in the International System of Units , and therefore 1 terabyte is , or 1 trillion bytes, or 1000 gigabytes. 1 terabyte in binary prefixes is 0.9095 tebibytes, or 931.32 gibibytes...
Track - Goal: to investigate whether/how the IRInformation retrievalInformation retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
community can scale traditional IR test-collection-based evaluation to significantly large collections. - VideoVideo search engineA video search engine is a web-based search engine which crawls the web for video content. Some video search engines parse externally hosted content while others allow content to be uploaded and hosted on their own servers. Some engines also allow users to search by video format type and by length...
Track - Goal: to research in automatic segmentation, indexIndex (information technology)In computer science, an index can be:# an integer that identifies an array element# a data structure that enables sublinear-time lookup -Array element identifier:...
ing, and content-based retrieval of digital videoDigital videoDigital video is a type of digital recording system that works by using a digital rather than an analog video signal.The terms camera, video camera, and camcorder are used interchangeably in this article.- History :...
.
- In 2003, this track became its own independent evaluation named TRECVIDTRECVIDThe TRECVID evaluation meetings are on-going series of workshops focusing on a list of different information retrieval research areas in content based retrieval of video. It is co-sponsored by the National Institute of Standards and Technology and the Intelligence Advanced Projects Activity of...
.
Related Events
In 1997, a Japanese counterpart of TREC was launched (first workshop in 1999), called NTCIR (NIINational Institute of Informatics
The is a Japanese research institute created in April 2000 for the purpose of advancing the study of informatics. This institute is also devoted to creating a system to facilitate the spread of scientific information to the general public. The NII is the only comprehensive research institute in...
Test Collection for IR Systems), and in 2000, a European counterpart was launched, called CLEF (Cross Language Evaluation Forum).
Conference Contributions
TREC claims that within the first six years of the workshops, the effectiveness of retrieval systems approximately doubled. The conference was also the first to hold large-scale evaluations of non-English documents, speech, video and retrieval across languages. Additionally, the challenges have inspired a large body of publications. Technology first developed in TREC is now included in many of the world's commercial search engineSearch engine
A search engine is an information retrieval system designed to help find information stored on a computer system. The search results are usually presented in a list and are commonly called hits. Search engines help to minimize the time required to find information and the amount of information...
s. An independent report by RTII found that "about one-third of the improvement in web search engines from 1999 to 2009 is attributable to TREC. Those enhancements likely saved up to 3 billion hours of time using web search engines. ... Additionally, the report showed that for every $1 that NIST and its partners invested in TREC, at least $3.35 to $5.07 in benefits were accrued to U.S. information retrieval researchers in both the private sector and academia."
While one study suggests that the state of the art for "ad-hoc" search has not advanced substantially in the past decade, it is referring just to search for topically relevant documents in small news and web collections of a few gigabytes. There have been advances in other types of ad-hoc search in the past decade. For example, test collections were created for known-item web search which found improvements from the use of anchor text, title weighting and url length, which were not useful techniques on the older ad-hoc test collections. In 2009, a new billion-page web collection was introduced, and spam filtering was found to be a useful technique for ad-hoc web search, unlike in past test collections.
The test collections developed at TREC are useful not just for (potentially) helping researchers advance the state of the art, but also for allowing developers of new (commercial) retrieval products to evaluate their effectiveness on standard tests. In the past decade, TREC has created new tests for enterprise e-mail search, genomics search, spam filtering, e-Discovery, and several other retrieval domains.
TREC systems often provide a baseline for further research. Examples include:
- Hal VarianHal VarianHal Ronald Varian is an economist specializing in microeconomics and information economics. He is the Chief Economist at Google and he holds the title of emeritus professor at the University of California, Berkeley where he was founding dean of the School of Information...
, Chief Economist at GoogleGoogleGoogle Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
, says Better data makes for better science. The history of information retrieval illustrates this principle well," and describes TREC's contribution. - TREC's Legal track has influenced the e-Discovery community both in research and in evaluation of commercial vendors.
- The IBMIBMInternational Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
researcher team building IBM Watson (aka DeepQA), which recently beat the world's best Jeopardy!Jeopardy!Griffin's first conception of the game used a board comprising ten categories with ten clues each, but after finding that this board could not be shown on camera easily, he reduced it to two rounds of thirty clues each, with five clues in each of six categories...
players, used data and systems from TREC's QA Track as baseline performance measurements.