Computational lexicology
Encyclopedia
Computational lexicology is that branch of computational linguistics
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

, which is concerned with the use of computers in the study of lexicon
Lexicon
In linguistics, the lexicon of a language is its vocabulary, including its words and expressions. A lexicon is also a synonym of the word thesaurus. More formally, it is a language's inventory of lexemes. Coined in English 1603, the word "lexicon" derives from the Greek "λεξικόν" , neut...

. It has been more narrowly described by some scholars (Amsler, 1980) as the use of computers in the study of machine-readable dictionaries
Machine-readable dictionary
Machine-readable dictionary is a dictionary stored as machine data instead of being printed on paper. It is an electronic dictionary and lexical database....

. It is distinguished from computational lexicography, which more properly would be the use of computers in the construction of dictionaries, though some researchers have used computational lexicography as synonymous.

History

Computational lexicology emerged as a separate discipline within computational linguistics with the appearance of machine-readable dictionaries, starting with the creation of the machine-readable tapes of the Merriam-Webster Seventh Collegiate Dictionary and the Merriam-Webster New Pocket Dictionary in the 1960s by John Olney et al. at System Development Corporation
System Development Corporation
System Development Corporation , based in Santa Monica, California, was considered the world's first computer software company.SDC started in 1955 as the systems engineering group for the SAGE air defense ground system at the RAND Corporation...

. Today, computational lexicology is best known through the creation and applications of WordNet
WordNet
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...

.

Study of lexicon

Computational lexicology has contributed to the understanding of the content and limitations of print dictionaries for computational purposes (i.e. it clarified that the previous work of lexicography was not sufficient for the needs of computational linguistics). Through the work of computational lexicologists almost every portion of a print dictionary entry has been studied ranging from:
  1. what constitutes a headword
    Headword
    A headword, head word, lemma, or sometimes catchword is the word under which a set of related dictionary or encyclopaedia entries appear. The headword is used to locate the entry, and dictates its alphabetical position...

     - used to generate spelling correction lists;
  2. what variants and inflections the headword forms - use to empirically understand morphology;
  3. how the headword is delimited into syllables;
  4. how the headword is pronounced - used in speech generation systems;
  5. the parts of speech the headword takes on - used for POS taggers;
  6. any special subject or usage codes assigned to the headword - used to identify text document subject matter;
  7. the headword's definitions and their syntax - used as an aid to disambiguation of word in context;
  8. the etymology of the headword and its use to characterize vocabulary by languages of origin - used to characterize text vocabulary as to its languages of origin;
  9. the example sentences;
  10. the run-ons (additional words and multi-word expressions that are formed from the headword); and
  11. related words such as synonym
    Synonym
    Synonyms are different words with almost identical or similar meanings. Words that are synonyms are said to be synonymous, and the state of being a synonym is called synonymy. The word comes from Ancient Greek syn and onoma . The words car and automobile are synonyms...

    s and antonym
    Antonym
    In lexical semantics, opposites are words that lie in an inherently incompatible binary relationship as in the opposite pairs male : female, long : short, up : down, and precede : follow. The notion of incompatibility here refers to the fact that one word in an opposite pair entails that it is not...

    s.


Many computational linguists were disenchanted with the print dictionaries as a resource for computational linguistics because they lacked sufficient syntactic and semantic information for computer programs. The work on computational lexicology quickly led to efforts in two additional directions.

Successors to Computational Lexicology

First, collaborative activities between computational linguists and lexicographers led to an understanding of the role that corpora played in creating dictionaries. Most computational lexicologists moved on to build large corpora to gather the basic data that lexicographers had used to create dictionaries. The ACL/DCI (Data Collection Initiative) and the LDC (Linguistic Data Consortium
Linguistic Data Consortium
The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and development purposes. The University of Pennsylvania is...

) went down this path. The advent of markup languages led to the creation of tagged corpora that could be more easily analyzed to create computational linguistic systems. Part-of-speech tagged corpora and semantically tagged corpora were created in order to test and develop POS taggers and word semantic disambiguation technology.

The second direction was toward the creation of Lexical Knowledge Bases (LKBs). A Lexical Knowledge Base was deemed to be what a dictionary should be for computational linguistic purposes, especially for computational lexical semantic purposes. It was to have the same information as in a print dictionary, but totally explicated as to the meanings of the words and the appropriate links between senses. Many began creating the resources they wished dictionaries were, if they had been created for use in computational analysis. WordNet
WordNet
WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short, general definitions, and records the various semantic relations between these synonym sets...

 can be considered to be such a development, as can the newer efforts at describing syntactic and semantic information such as the FrameNet work of Fillmore. Outside of computational linguistics, the Ontology work of artificial intelligence can be seen as an evolutionary effort to build a lexical knowledge base for AI applications.

Standardization

Optimizing the production, maintenance and extension of computational lexicons is one of the crucial aspects impacting NLP
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....

. The main problem is the interoperability
Interoperability
Interoperability is a property referring to the ability of diverse systems and organizations to work together . The term is often used in a technical systems engineering sense, or alternatively in a broad sense, taking into account social, political, and organizational factors that impact system to...

: various lexicons are frequently incompatible. The most frequent situation is: how to merge two lexicons, or fragments of lexicons? A secondary problem is that a lexicon is usually specifically tailored to a specific NLP program and has difficulties being used within other NLP programs or applications.

To this respect, the various data models of Computational lexicons are studied by ISO/TC37 since 2003 within the project lexical markup framework
Lexical Markup Framework
ISO 24613:2008, Language resource management - Lexical markup framework , is the ISO International Organization for Standardization ISO/TC37 standard for natural language processing and machine-readable dictionary lexicons...

leading to an ISO standard in 2008.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK