Earley parser
Encyclopedia
In computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

, the Earley parser is an algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

 for parsing
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...

 strings
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

 that belong to a given context-free language
Context-free language
In formal language theory, a context-free language is a language generated by some context-free grammar. The set of all context-free languages is identical to the set of languages accepted by pushdown automata.-Examples:...

, named after its inventor, Jay Earley
Jay Earley
Jay Earley is an American computer scientist and psychologist. He invented the Earley parser in his early career in computer science. Later he became a clinical psychologist specializing in group therapy and Internal Family Systems Therapy ....

. The algorithm is a chart parser
Chart parser
A chart parser is a type of parser suitable for ambiguous grammars . It uses the dynamic programming approach—partial hypothesized results are stored in a structure called a chart and can be re-used...

 that uses dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...

, and is mainly used for parsing in computational linguistics
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....

.

Earley parsers are appealing because they can parse all context-free languages, unlike LR parser
LR parser
In computer science, an LR parser is a parser that reads input from Left to right and produces a Rightmost derivation. The term LR parser is also used; where the k refers to the number of unconsumed "look ahead" input symbols that are used in making parsing decisions...

s and LL parser
LL parser
An LL parser is a top-down parser for a subset of the context-free grammars. It parses the input from Left to right, and constructs a Leftmost derivation of the sentence...

s which are more typically used in compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...

s but which can only handle restricted classes of languages. The Earley parser executes in cubic time in the general case , where n is the length of the parsed string, quadratic time for unambiguous grammars , and linear time for almost all LR(k) grammars
LR parser
In computer science, an LR parser is a parser that reads input from Left to right and produces a Rightmost derivation. The term LR parser is also used; where the k refers to the number of unconsumed "look ahead" input symbols that are used in making parsing decisions...

. It performs particularly well when the rules are written left-recursively
Left recursion
In computer science, left recursion is a special case of recursion.In terms of context-free grammar, a non-terminal r is left-recursive if the left-most symbol in any of r’s ‘alternatives’ either immediately or through some other non-terminal definitions rewrites to r again.- Definition :"A...

.

The algorithm

In the following descriptions, α, β, and γ represent any string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

 of terminals/nonterminals
Terminal and nonterminal symbols
In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules that constitute a formal grammar...

 (including the empty string
Empty string
In computer science and formal language theory, the empty string is the unique string of length zero. It is denoted with λ or sometimes Λ or ε....

), X and Y represent single nonterminals, and a represents a terminal symbol.

Earley's algorithm is a top-down dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...

 algorithm. In the following, we use Earley's dot notation: given a production X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected.

For every input position (which represents a position between tokens
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...

), the parser generates an ordered state set. Each state is a tuple
Tuple
In mathematics and computer science, a tuple is an ordered list of elements. In set theory, an n-tuple is a sequence of n elements, where n is a positive integer. There is also one 0-tuple, an empty sequence. An n-tuple is defined inductively using the construction of an ordered pair...

 (X → α • β, i), consisting of
  • the production currently being matched (X → α β)
  • our current position in that production (represented by the dot)
  • the position i in the input at which the matching of this production began: the origin position


(Earley's original algorithm included a look-ahead in the state; later research showed this to have little practical effect on the parsing efficiency, and it has subsequently been dropped from most implementations.)

The state set at input position k is called S(k). The parser is seeded with S(0) consisting of only the top-level rule. The parser then iteratively operates in three stages: prediction, scanning, and completion.
  • Prediction: For every state in S(k) of the form (X → α • Y β, j) (where j is the origin position as above), add (Y → • γ, k) to S(k) for every production in the grammar with Y on the left-hand side (Y → γ).

  • Scanning: If a is the next symbol in the input stream, for every state in S(k) of the form (X → α • a β, j), add (X → α a • β, j) to S(k+1).

  • Completion: For every state in S(k) of the form (X → γ •, j), find states in S(j) of the form (Y → α • X β, i) and add (Y → α X • β, i) to S(k).


These steps are repeated until no more states can be added to the set. The set is generally implemented as a queue of states to process (though a given state must appear in one place only), and performing the corresponding operation depending on what kind of state it is.

Pseudocode

(Adapted from http://books.google.co.uk/books?id=fZmj5UNK8AQC&q=earley#search_anchor)


function EARLEY-PARSE(words, grammar)
ENQUEUE((γ → •S, 0), chart[0])
for i ← from 0 to LENGTH(words) do
for each state in chart[i] do
if INCOMPLETE?(state) then
if NEXT-CAT(state) is a nonterminal then
PREDICTOR(state, i, grammar) // non-terminal
else do
SCANNER(state, i) // terminal
else do
COMPLETER(state, i)
end
end
return chart

procedure PREDICTOR((A → α•B, i), j, grammar),
for each (B → γ) in GRAMMAR-RULES-FOR(B, grammar) do
ADD-TO-SET((B → •γ, j), chart[ j])
end

procedure SCANNER((A → α•B, i), j),
if B ⊂ PARTS-OF-SPEECH(word[j]) then
ADD-TO-SET((B → word[j], i), chart[j + 1])
end

procedure COMPLETER((B → γ•, j), k),
for each (A → α•Bβ, i) in chart[j] do
ADD-TO-SET((A → αB•β, i), chart[k])
end

Example

Consider the following simple grammar for arithmetic expressions:

::= S # the start rule
::= "+" |
::= "*" |
::= "1" | "2" | "3" | "4"

With the input:
2 + 3 * 4

This is the sequence of state sets:

(state no.) Production (Origin) # Comment
-----------------------------------------

S(0): • 2 + 3 * 4

(1) P → • S (0) # start rule
(2) S → • S + M (0) # predict from (1)
(3) S → • M (0) # predict from (1)
(4) M → • M * T (0) # predict from (3)
(5) M → • T (0) # predict from (3)
(6) T → • number (0) # predict from (5)

S(1): 2 • + 3 * 4

(1) T → number • (0) # scan from S(0)(6)
(2) M → T • (0) # complete from (1) and S(0)(5)
(3) M → M • * T (0) # complete from (2) and S(0)(4)
(4) S → M • (0) # complete from (2) and S(0)(3)
(5) S → S • + M (0) # complete from (4) and S(0)(2)
(6) P → S • (0) # complete from (4) and S(0)(1)

S(2): 2 + • 3 * 4

(1) S → S + • M (0) # scan from S(1)(5)
(2) M → • M * T (2) # predict from (1)
(3) M → • T (2) # predict from (1)
(4) T → • number (2) # predict from (3)

S(3): 2 + 3 • * 4

(1) T → number • (2) # scan from S(2)(4)
(2) M → T • (2) # complete from (1) and S(2)(3)
(3) M → M • * T (2) # complete from (2) and S(2)(2)
(4) S → S + M • (0) # complete from (2) and S(2)(1)
(5) S → S • + M (0) # complete from (4) and S(0)(2)
(6) P → S • (0) # complete from (4) and S(0)(1)

S(4): 2 + 3 * • 4

(1) M → M * • T (2) # scan from S(3)(3)
(2) T → • number (4) # predict from (1)

S(5): 2 + 3 * 4 •

(1) T → number • (4) # scan from S(4)(2)
(2) M → M * T • (2) # complete from (1) and S(4)(1)
(3) M → M • * T (2) # complete from (2) and S(2)(2)
(4) S → S + M • (0) # complete from (2) and S(2)(1)
(5) S → S • + M (0) # complete from (4) and S(0)(2)
(6) P → S • (0) # complete from (4) and S(0)(1)

The state (P → S •, 0) represents a completed parse. This state also appears in S(3) and S(1), which are complete sentences.

See also

  • CYK algorithm
    CYK algorithm
    The Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars. It employs bottom-up parsing and dynamic programming....

  • Context-free grammar
    Context-free grammar
    In formal language theory, a context-free grammar is a formal grammar in which every production rule is of the formwhere V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals ....

  • Parsing Algorithms

C Implementations

  • 'early' An Earley parser C
    C (programming language)
    C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

     -library.
  • 'C Earley Parser' An Earley parser C
    C (programming language)
    C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

    .

Java Implementations


Perl Implementations

  • Marpa::XS and Marpa::PP, Perl
    Perl
    Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

     modules, incorporating improvements made to the Earley algorithm by Joop Leo, and by Aycock and Horspool.
  • Parse::Earley An Perl
    Perl
    Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

     module that implements Jay Earley's original algorithm.

Python Implementations

  • Charty a Python
    Python (programming language)
    Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

     implementation of an Earley parser.
  • NLTK a Python
    Python (programming language)
    Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

     toolkit that has an Earley parser.
  • Spark an Object Oriented "little language framework" for Python
    Python (programming language)
    Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

     that implements an Earley parser.

Common Lisp Implementations


Scheme/Racket Implementations

  • Charty-Racket A Scheme / Racket implementation of an Earley parser.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK