General Architecture for Text Engineering
Encyclopedia
General Architecture for Text Engineering or GATE is a Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

 suite of tools originally developed at the University of Sheffield
University of Sheffield
The University of Sheffield is a research university based in the city of Sheffield in South Yorkshire, England. It is one of the original 'red brick' universities and is a member of the Russell Group of leading research intensive universities...

 beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for all sorts of natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

 tasks, including information extraction
Information extraction
Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

 in many languages.

GATE has been compared to NLTK, R
R (programming language)
R is a programming language and software environment for statistical computing and graphics. The R language is widely used among statisticians for developing statistical software, and R is widely used for statistical software development and data analysis....

 and RapidMiner. As well as being widely used in its own right, it forms the basis of the KIM semantic platform.

GATE community and research has been involved in several European research projects including TAO, SEKT, NeOn, Media-Campaign, Musing, Service-Finder, LIRICS and KnowledgeWeb, as well as many other projects.

As of May 28, 2011, 881 people are on the gate-users mailing list at SourceForge.net, and 111,932 downloads from SourceForge
SourceForge
SourceForge Enterprise Edition is a collaborative revision control and software development management system. It provides a front-end to a range of software development lifecycle services and integrates with a number of free software / open source software applications .While originally itself...

 are recorded since the project moved to SourceForge in 2005. The paper "GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications" has received over 800 citations in the seven years since publication (according to Google Scholar). Books covering the use of GATE, in addition to the GATE User Guide, include "Building Search Applications: Lucene, LingPipe, and Gate", by Manu Konchady, and "Introduction to Linguistic Annotation and Text Analytics", by Graham Wilcock.

Features

GATE includes an information extraction
Information extraction
Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

 system called ANNIE (A Nearly-New Information Extraction System) which is a set of modules comprising a tokenizer
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...

, a gazetteer
Gazetteer
A gazetteer is a geographical dictionary or directory, an important reference for information about places and place names , used in conjunction with a map or a full atlas. It typically contains information concerning the geographical makeup of a country, region, or continent as well as the social...

, a sentence splitter
Sentence boundary disambiguation
Sentence boundary disambiguation , also known as sentence breaking, is the problem in natural language processing of deciding where sentences begin and end. Often natural language processing tools require their input to be divided into sentences for a number of reasons. However sentence boundary...

, a part of speech tagger
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging , also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e...

, a named entities
Named entity recognition
Named-entity recognition is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.Most research on NER...

 transducer and a coreference
Coreference
In linguistics, co-reference occurs when multiple expressions in a sentence or document refer to the same thing; or in linguistic jargon, they have the same "referent."...

 tagger. ANNIE can be used as-is to provide basic information extraction
Information extraction
Information extraction is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language...

 functionality, or provide a starting point for more specific tasks.

Languages currently handled in GATE include English
English language
English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...

, Spanish
Spanish language
Spanish , also known as Castilian , is a Romance language in the Ibero-Romance group that evolved from several languages and dialects in central-northern Iberia around the 9th century and gradually spread with the expansion of the Kingdom of Castile into central and southern Iberia during the...

, Chinese, Arabic, Bulgarian
Bulgarian language
Bulgarian is an Indo-European language, a member of the Slavic linguistic group.Bulgarian, along with the closely related Macedonian language, demonstrates several linguistic characteristics that set it apart from all other Slavic languages such as the elimination of case declension, the...

, French
French language
French is a Romance language spoken as a first language in France, the Romandy region in Switzerland, Wallonia and Brussels in Belgium, Monaco, the regions of Quebec and Acadia in Canada, and by various communities elsewhere. Second-language speakers of French are distributed throughout many parts...

, German
German language
German is a West Germanic language, related to and classified alongside English and Dutch. With an estimated 90 – 98 million native speakers, German is one of the world's major languages and is the most widely-spoken first language in the European Union....

, Hindi
Hindi
Standard Hindi, or more precisely Modern Standard Hindi, also known as Manak Hindi , High Hindi, Nagari Hindi, and Literary Hindi, is a standardized and sanskritized register of the Hindustani language derived from the Khariboli dialect of Delhi...

, Italian
Italian language
Italian is a Romance language spoken mainly in Europe: Italy, Switzerland, San Marino, Vatican City, by minorities in Malta, Monaco, Croatia, Slovenia, France, Libya, Eritrea, and Somalia, and by immigrant communities in the Americas and Australia...

, Cebuano
Cebuano language
Cebuano, referred to by most of its speakers as Bisaya , is an Austronesian language spoken in the Philippines by about 20 million people mostly in the Central Visayas. It is the most widely spoken of the languages within the so-named Bisayan subgroup and is closely related to other Filipino...

, Romanian
Romanian language
Romanian Romanian Romanian (or Daco-Romanian; obsolete spellings Rumanian, Roumanian; self-designation: română, limba română ("the Romanian language") or românește (lit. "in Romanian") is a Romance language spoken by around 24 to 28 million people, primarily in Romania and Moldova...

, Russian
Russian language
Russian is a Slavic language used primarily in Russia, Belarus, Uzbekistan, Kazakhstan, Tajikistan and Kyrgyzstan. It is an unofficial but widely spoken language in Ukraine, Moldova, Latvia, Turkmenistan and Estonia and, to a lesser extent, the other countries that were once constituent republics...

.

Plugins are included for machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...

 with Weka
Weka (machine learning)
Weka is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand...

, RASP, MAXENT, SVM Light, as well as a fast LibSVM integration and an in-house perceptron
Perceptron
The perceptron is a type of artificial neural network invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt. It can be seen as the simplest kind of feedforward neural network: a linear classifier.- Definition :...

 implementation, for managing Ontology (information science) like WordNet
WordNet
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...

, for querying search engines like Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...

 or Yahoo, for part of speech tagging with Brill
Brill Tagger
The Brill tagger is a method for doing part-of-speech tagging. It was described by Eric Brill in his 1993 PhD thesis . It can be summarized as an "error-driven transformation-based tagger". It is...

 or TreeTagger, and many more.

GATE accepts input in various formats, such as TXT
Text file
A text file is a kind of computer file that is structured as a sequence of lines of electronic text. A text file exists within a computer file system...

, HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

, XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

, Doc
DOC (computing)
In computing, DOC or doc is a filename extension for word processing documents; most commonly for Microsoft Word. Historically, the extension was used for documentation in plain-text format, particularly of programs or computer hardware, on a wide range of operating systems...

, PDF documents, and Java Serial
Serialization
In computer science, in the context of data storage and transmission, serialization is the process of converting a data structure or object state into a format that can be stored and "resurrected" later in the same or another computer environment...

, PostgreSQL
PostgreSQL
PostgreSQL, often simply Postgres, is an object-relational database management system available for many platforms including Linux, FreeBSD, Solaris, MS Windows and Mac OS X. It is released under the PostgreSQL License, which is an MIT-style license, and is thus free and open source software...

, Lucene
Lucene
Apache Lucene is a free/open source information retrieval software library, originally created in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License....

, Oracle
Oracle database
The Oracle Database is an object-relational database management system produced and marketed by Oracle Corporation....

 Databases with help of RDBMS storage over JDBC.

JAPE transducers are used within GATE to manipulate annotations on text. Documentation is provided in the GATE User Guide. A tutorial has also been written by Press Association Images.

GATE Developer

The screenshot shows the document viewer used to display a document and its annotations. In pink are hyperlink annotations from an HTML file. The right list is the annotation sets list, and the bottom table is the annotation list. In the center is the annotation editor window.

GATE Mímir

GATE based applications often generate vast quantities of information including; natural language text, semantic annotations, and ontological information. Sometimes the data itself is the end product of an application but often the information would be more useful if it could be efficiently searched. GATE Mimir provides support for indexing and searching the linguistic and semantic information generated by such applications and allows for querying the information using arbitrary combinations of text, structural information, and
SPARQL
SPARQL
SPARQL is an RDF query language; its name is an acronym that stands for SPARQL Protocol and RDF Query Language. It was made a standard by the RDF Data Access Working Group of the World Wide Web Consortium, and considered as one of the key technologies of semantic web...

.

See also

  • Unstructured Information Management Architecture (UIMA)
  • OpenNLP
    OpenNLP
    The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution. These tasks...

  • Natural language processing toolkits
    Natural language processing toolkits
    The following natural language processing toolkits are popular collections of natural language processing software. They are suites of libraries, frameworks, and applications for symbolic, statistical natural language and speech processing...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK