EXCLAIM
Encyclopedia
The EXtensible Cross-Linguistic Automatic Information Machine (EXCLAIM) is an integrated tool for cross-language information retrieval
Cross-language information retrieval
Cross-language information retrieval is a subfield of information retrieval dealing with retrieving information written in a language different from the language of the user's query. For example, a user may pose their query in English but retrieve relevant documents written in French.The first...

 (CLIR), created at the University of California, Santa Cruz
University of California, Santa Cruz
The University of California, Santa Cruz, also known as UC Santa Cruz or UCSC, is a public, collegiate university; one of ten campuses in the University of California...

 in early 2006. It is currently in a beta stage of development, with some support for more than a dozen languages. The lead developers are Justin Nuger and Jesse Saba Kirchner.

Early work on CLIR depended on manually constructed parallel corpora for each pair of languages. This method is labor-intensive compared to parallel corpora created automatically. A more efficient way of finding data to train a CLIR system is to use matching pages on the web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

 which are written in different languages.

EXCLAIM capitalizes on the idea of latent parallel corpora on the web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...

 by automating the alignment of such corpora in various domains. The most significant of these is Wikipedia
Wikipedia
Wikipedia is a free, web-based, collaborative, multilingual encyclopedia project supported by the non-profit Wikimedia Foundation. Its 20 million articles have been written collaboratively by volunteers around the world. Almost all of its articles can be edited by anyone with access to the site,...

 itself, which includes articles in 250 languages. The role of EXCLAIM is to use semantics
Semantics
Semantics is the study of meaning. It focuses on the relation between signifiers, such as words, phrases, signs and symbols, and what they stand for, their denotata....

 and linguistic
Linguistics
Linguistics is the scientific study of human language. Linguistics can be broadly broken into three categories or subfields of study: language form, language meaning, and language in context....

 analytic tools to align the information in these Wikipedias so that they can be treated as parallel corpora. EXCLAIM is also extensible to incorporate information from many other sources, such as the Chinese Community Health Resource Center (CCHRC).

One of the main goals of the EXCLAIM project is to provide the kind of computational tools and CLIR tools for minority languages and endangered languages which are often available only for powerful or prosperous majority languages.

Current status

EXCLAIM is in a beta state, with varying degrees of functionality for different languages. Support for CLIR using the Wikipedia dataset and the most current version of EXCLAIM (v.0.5), including full UTF-8 support and Porter stemming for the English component, is available for the following twenty-three languages:
Albanian
Albanian language
Albanian is an Indo-European language spoken by approximately 7.6 million people, primarily in Albania and Kosovo but also in other areas of the Balkans in which there is an Albanian population, including western Macedonia, southern Montenegro, southern Serbia and northwestern Greece...

Amharic
Bengali
Bengali language
Bengali or Bangla is an eastern Indo-Aryan language. It is native to the region of eastern South Asia known as Bengal, which comprises present day Bangladesh, the Indian state of West Bengal, and parts of the Indian states of Tripura and Assam. It is written with the Bengali script...

Gothic
Gothic language
Gothic is an extinct Germanic language that was spoken by the Goths. It is known primarily from the Codex Argenteus, a 6th-century copy of a 4th-century Bible translation, and is the only East Germanic language with a sizable Text corpus...

Greek
Greek language
Greek is an independent branch of the Indo-European family of languages. Native to the southern Balkans, it has the longest documented history of any Indo-European language, spanning 34 centuries of written records. Its writing system has been the Greek alphabet for the majority of its history;...

Icelandic
Icelandic language
Icelandic is a North Germanic language, the main language of Iceland. Its closest relative is Faroese.Icelandic is an Indo-European language belonging to the North Germanic or Nordic branch of the Germanic languages. Historically, it was the westernmost of the Indo-European languages prior to the...

Indonesian
Indonesian language
Indonesian is the official language of Indonesia. Indonesian is a normative form of the Riau Islands dialect of Malay, an Austronesian language which has been used as a lingua franca in the Indonesian archipelago for centuries....

Irish
Irish language
Irish , also known as Irish Gaelic, is a Goidelic language of the Indo-European language family, originating in Ireland and historically spoken by the Irish people. Irish is now spoken as a first language by a minority of Irish people, as well as being a second language of a larger proportion of...

Javanese
Javanese language
Javanese language is the language of the Javanese people from the central and eastern parts of the island of Java, in Indonesia. In addition, there are also some pockets of Javanese speakers in the northern coast of western Java...

Latvian
Latvian language
Latvian is the official state language of Latvia. It is also sometimes referred to as Lettish. There are about 1.4 million native Latvian speakers in Latvia and about 150,000 abroad. The Latvian language has a relatively large number of non-native speakers, atypical for a small language...

Malagasy
Malagasy language
Malagasy is the national language of Madagascar, a member of the Austronesian family of languages. Most people in Madagascar speak it as a first language as do some people of Malagasy descent elsewhere.-History:...

Mandarin Chinese
Nahuatl
Nahuatl
Nahuatl is thought to mean "a good, clear sound" This language name has several spellings, among them náhuatl , Naoatl, Nauatl, Nahuatl, Nawatl. In a back formation from the name of the language, the ethnic group of Nahuatl speakers are called Nahua...

Navajo
Navajo language
Navajo or Navaho is an Athabaskan language spoken in the southwestern United States. It is geographically and linguistically one of the Southern Athabaskan languages .Navajo has more speakers than any other Native American language north of the...

Quechua
Quechua languages
Quechua is a Native South American language family and dialect cluster spoken primarily in the Andes of South America, derived from an original common ancestor language, Proto-Quechua. It is the most widely spoken language family of the indigenous peoples of the Americas, with a total of probably...

Sardinian
Sardinian language
Sardinian is a Romance language spoken and written on most of the island of Sardinia . It is considered the most conservative of the Romance languages in terms of phonology and is noted for its Paleosardinian substratum....

Swahili
Swahili language
Swahili or Kiswahili is a Bantu language spoken by various ethnic groups that inhabit several large stretches of the Mozambique Channel coastline from northern Kenya to northern Mozambique, including the Comoro Islands. It is also spoken by ethnic minority groups in Somalia...

Tagalog
Tagalog language
Tagalog is an Austronesian language spoken as a first language by a third of the population of the Philippines and as a second language by most of the rest. It is the first language of the Philippine region IV and of Metro Manila...

Tibetan
Tibetan language
The Tibetan languages are a cluster of mutually-unintelligible Tibeto-Burman languages spoken primarily by Tibetan peoples who live across a wide area of eastern Central Asia bordering the Indian subcontinent, including the Tibetan Plateau and the northern Indian subcontinent in Baltistan, Ladakh,...

Turkish
Turkish language
Turkish is a language spoken as a native language by over 83 million people worldwide, making it the most commonly spoken of the Turkic languages. Its speakers are located predominantly in Turkey and Northern Cyprus with smaller groups in Iraq, Greece, Bulgaria, the Republic of Macedonia, Kosovo,...

Welsh
Welsh language
Welsh is a member of the Brythonic branch of the Celtic languages spoken natively in Wales, by some along the Welsh border in England, and in Y Wladfa...

Wolof
Wolof language
Wolof is a language spoken in Senegal, The Gambia, and Mauritania, and is the native language of the Wolof people. Like the neighbouring languages Serer and Fula, it belongs to the Atlantic branch of the Niger–Congo language family...

Yiddish


Support using the Wikipedia dataset and an earlier version of EXCLAIM (v.0.3) is available for the following languages:
Dutch
Dutch language
Dutch is a West Germanic language and the native language of the majority of the population of the Netherlands, Belgium, and Suriname, the three member states of the Dutch Language Union. Most speakers live in the European Union, where it is a first language for about 23 million and a second...

Spanish
Spanish language
Spanish , also known as Castilian , is a Romance language in the Ibero-Romance group that evolved from several languages and dialects in central-northern Iberia around the 9th century and gradually spread with the expansion of the Kingdom of Castile into central and southern Iberia during the...



Significant developments in the most recent version of EXCLAIM include support for Mandarin Chinese. By developing support for this language, EXCLAIM has added solutions to segmentation
Text segmentation
Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing...

 and encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

 problems which will allow the system to be extended to many other languages written with non-European orthographic conventions. This support is supplied through the Trimming And Reformatting Modular System (TARMS) toolkit.

Future versions of EXCLAIM will extend the system to additional languages. Other goals include incorporation of available latent datasets in addition to the Wikipedia dataset.

The EXCLAIM development plan calls for an integrated CLIR instrument usable searching from English for information in any of the supported languages, or searching from any of the supported languages for information in English when EXCLAIM 1.0 is released. Future versions will allow searching from any supported language into any other, and searching from and into multiple languages.

Further applications

EXCLAIM has been incorporated into several projects which rely on cross-language query expansion
Query expansion
Query expansion is the process of reformulating a seed query to improve retrieval performance in information retrieval operations.In the context of web search engines, query expansion involves evaluating a user's input and expanding the search query to match additional documents...

 as part of their backends. One such project is a cross-linguistic readability
Readability
Readability is the ease in which text can be read and understood. Various factors to measure readability have been used, such as "speed of perception," "perceptibility at a distance," "perceptibility in peripheral vision," "visibility," "the reflex blink technique," "rate of work" , "eye...

 software generation framework, detailed in work presented at ACL 2009
Association for Computational Linguistics
The Association for Computational Linguistics is the international scientific and professional society for people working on problems involving natural language and computation. An annual meeting is held each summer in locations where significant computational linguistics research is carried out...

.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK