Earley parser
Encyclopedia
In computer science
, the Earley parser is an algorithm
for parsing
strings
that belong to a given context-free language
, named after its inventor, Jay Earley
. The algorithm is a chart parser
that uses dynamic programming
, and is mainly used for parsing in computational linguistics
.
Earley parsers are appealing because they can parse all context-free languages, unlike LR parser
s and LL parser
s which are more typically used in compiler
s but which can only handle restricted classes of languages. The Earley parser executes in cubic time in the general case , where n is the length of the parsed string, quadratic time for unambiguous grammars , and linear time for almost all LR(k) grammars
. It performs particularly well when the rules are written left-recursively
.
of terminals/nonterminals
(including the empty string
), X and Y represent single nonterminals, and a represents a terminal symbol.
Earley's algorithm is a top-down dynamic programming
algorithm. In the following, we use Earley's dot notation: given a production X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected.
For every input position (which represents a position between tokens
), the parser generates an ordered state set. Each state is a tuple
(X → α • β, i), consisting of
(Earley's original algorithm included a look-ahead in the state; later research showed this to have little practical effect on the parsing efficiency, and it has subsequently been dropped from most implementations.)
The state set at input position k is called S(k). The parser is seeded with S(0) consisting of only the top-level rule. The parser then iteratively operates in three stages: prediction, scanning, and completion.
These steps are repeated until no more states can be added to the set. The set is generally implemented as a queue of states to process (though a given state must appear in one place only), and performing the corresponding operation depending on what kind of state it is.
function EARLEY-PARSE(words, grammar)
ENQUEUE((γ → •S, 0), chart[0])
for i ← from 0 to LENGTH(words) do
for each state in chart[i] do
if INCOMPLETE?(state) then
if NEXT-CAT(state) is a nonterminal then
PREDICTOR(state, i, grammar) // non-terminal
else do
SCANNER(state, i) // terminal
else do
COMPLETER(state, i)
end
end
return chart
procedure PREDICTOR((A → α•B, i), j, grammar),
for each (B → γ) in GRAMMAR-RULES-FOR(B, grammar) do
ADD-TO-SET((B → •γ, j), chart[ j])
end
procedure SCANNER((A → α•B, i), j),
if B ⊂ PARTS-OF-SPEECH(word[j]) then
ADD-TO-SET((B → word[j], i), chart[j + 1])
end
procedure COMPLETER((B → γ•, j), k),
for each (A → α•Bβ, i) in chart[j] do
ADD-TO-SET((A → αB•β, i), chart[k])
end
With the input:
2 + 3 * 4
This is the sequence of state sets:
(state no.) Production (Origin) # Comment
-----------------------------------------
(2) S → • S + M (0) # predict from (1)
(3) S → • M (0) # predict from (1)
(4) M → • M * T (0) # predict from (3)
(5) M → • T (0) # predict from (3)
(6) T → • number (0) # predict from (5)
(2) M → T • (0) # complete from (1) and S(0)(5)
(3) M → M • * T (0) # complete from (2) and S(0)(4)
(4) S → M • (0) # complete from (2) and S(0)(3)
(5) S → S • + M (0) # complete from (4) and S(0)(2)
(6) P → S • (0) # complete from (4) and S(0)(1)
(2) M → • M * T (2) # predict from (1)
(3) M → • T (2) # predict from (1)
(4) T → • number (2) # predict from (3)
(2) M → T • (2) # complete from (1) and S(2)(3)
(3) M → M • * T (2) # complete from (2) and S(2)(2)
(4) S → S + M • (0) # complete from (2) and S(2)(1)
(5) S → S • + M (0) # complete from (4) and S(0)(2)
(6) P → S • (0) # complete from (4) and S(0)(1)
(2) T → • number (4) # predict from (1)
(2) M → M * T • (2) # complete from (1) and S(4)(1)
(3) M → M • * T (2) # complete from (2) and S(2)(2)
(4) S → S + M • (0) # complete from (2) and S(2)(1)
(5) S → S • + M (0) # complete from (4) and S(0)(2)
(6) P → S • (0) # complete from (4) and S(0)(1)
The state (P → S •, 0) represents a completed parse. This state also appears in S(3) and S(1), which are complete sentences.
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...
, the Earley parser is an algorithm
Algorithm
In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...
for parsing
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...
strings
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
that belong to a given context-free language
Context-free language
In formal language theory, a context-free language is a language generated by some context-free grammar. The set of all context-free languages is identical to the set of languages accepted by pushdown automata.-Examples:...
, named after its inventor, Jay Earley
Jay Earley
Jay Earley is an American computer scientist and psychologist. He invented the Earley parser in his early career in computer science. Later he became a clinical psychologist specializing in group therapy and Internal Family Systems Therapy ....
. The algorithm is a chart parser
Chart parser
A chart parser is a type of parser suitable for ambiguous grammars . It uses the dynamic programming approach—partial hypothesized results are stored in a structure called a chart and can be re-used...
that uses dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...
, and is mainly used for parsing in computational linguistics
Computational linguistics
Computational linguistics is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective....
.
Earley parsers are appealing because they can parse all context-free languages, unlike LR parser
LR parser
In computer science, an LR parser is a parser that reads input from Left to right and produces a Rightmost derivation. The term LR parser is also used; where the k refers to the number of unconsumed "look ahead" input symbols that are used in making parsing decisions...
s and LL parser
LL parser
An LL parser is a top-down parser for a subset of the context-free grammars. It parses the input from Left to right, and constructs a Leftmost derivation of the sentence...
s which are more typically used in compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...
s but which can only handle restricted classes of languages. The Earley parser executes in cubic time in the general case , where n is the length of the parsed string, quadratic time for unambiguous grammars , and linear time for almost all LR(k) grammars
LR parser
In computer science, an LR parser is a parser that reads input from Left to right and produces a Rightmost derivation. The term LR parser is also used; where the k refers to the number of unconsumed "look ahead" input symbols that are used in making parsing decisions...
. It performs particularly well when the rules are written left-recursively
Left recursion
In computer science, left recursion is a special case of recursion.In terms of context-free grammar, a non-terminal r is left-recursive if the left-most symbol in any of r’s ‘alternatives’ either immediately or through some other non-terminal definitions rewrites to r again.- Definition :"A...
.
The algorithm
In the following descriptions, α, β, and γ represent any stringString (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
of terminals/nonterminals
Terminal and nonterminal symbols
In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules that constitute a formal grammar...
(including the empty string
Empty string
In computer science and formal language theory, the empty string is the unique string of length zero. It is denoted with λ or sometimes Λ or ε....
), X and Y represent single nonterminals, and a represents a terminal symbol.
Earley's algorithm is a top-down dynamic programming
Dynamic programming
In mathematics and computer science, dynamic programming is a method for solving complex problems by breaking them down into simpler subproblems. It is applicable to problems exhibiting the properties of overlapping subproblems which are only slightly smaller and optimal substructure...
algorithm. In the following, we use Earley's dot notation: given a production X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected.
For every input position (which represents a position between tokens
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...
), the parser generates an ordered state set. Each state is a tuple
Tuple
In mathematics and computer science, a tuple is an ordered list of elements. In set theory, an n-tuple is a sequence of n elements, where n is a positive integer. There is also one 0-tuple, an empty sequence. An n-tuple is defined inductively using the construction of an ordered pair...
(X → α • β, i), consisting of
- the production currently being matched (X → α β)
- our current position in that production (represented by the dot)
- the position i in the input at which the matching of this production began: the origin position
(Earley's original algorithm included a look-ahead in the state; later research showed this to have little practical effect on the parsing efficiency, and it has subsequently been dropped from most implementations.)
The state set at input position k is called S(k). The parser is seeded with S(0) consisting of only the top-level rule. The parser then iteratively operates in three stages: prediction, scanning, and completion.
- Prediction: For every state in S(k) of the form (X → α • Y β, j) (where j is the origin position as above), add (Y → • γ, k) to S(k) for every production in the grammar with Y on the left-hand side (Y → γ).
- Scanning: If a is the next symbol in the input stream, for every state in S(k) of the form (X → α • a β, j), add (X → α a • β, j) to S(k+1).
- Completion: For every state in S(k) of the form (X → γ •, j), find states in S(j) of the form (Y → α • X β, i) and add (Y → α X • β, i) to S(k).
These steps are repeated until no more states can be added to the set. The set is generally implemented as a queue of states to process (though a given state must appear in one place only), and performing the corresponding operation depending on what kind of state it is.
Pseudocode
(Adapted from http://books.google.co.uk/books?id=fZmj5UNK8AQC&q=earley#search_anchor)function EARLEY-PARSE(words, grammar)
ENQUEUE((γ → •S, 0), chart[0])
for i ← from 0 to LENGTH(words) do
for each state in chart[i] do
if INCOMPLETE?(state) then
if NEXT-CAT(state) is a nonterminal then
PREDICTOR(state, i, grammar) // non-terminal
else do
SCANNER(state, i) // terminal
else do
COMPLETER(state, i)
end
end
return chart
procedure PREDICTOR((A → α•B, i), j, grammar),
for each (B → γ) in GRAMMAR-RULES-FOR(B, grammar) do
ADD-TO-SET((B → •γ, j), chart[ j])
end
procedure SCANNER((A → α•B, i), j),
if B ⊂ PARTS-OF-SPEECH(word[j]) then
ADD-TO-SET((B → word[j], i), chart[j + 1])
end
procedure COMPLETER((B → γ•, j), k),
for each (A → α•Bβ, i) in chart[j] do
ADD-TO-SET((A → αB•β, i), chart[k])
end
Example
Consider the following simple grammar for arithmetic expressions: ::= S # the start rule
::= "+"
With the input:
2 + 3 * 4
This is the sequence of state sets:
(state no.) Production (Origin) # Comment
-----------------------------------------
S(0): • 2 + 3 * 4
(1) P → • S (0) # start rule(2) S → • S + M (0) # predict from (1)
(3) S → • M (0) # predict from (1)
(4) M → • M * T (0) # predict from (3)
(5) M → • T (0) # predict from (3)
(6) T → • number (0) # predict from (5)
S(1): 2 • + 3 * 4
(1) T → number • (0) # scan from S(0)(6)(2) M → T • (0) # complete from (1) and S(0)(5)
(3) M → M • * T (0) # complete from (2) and S(0)(4)
(4) S → M • (0) # complete from (2) and S(0)(3)
(5) S → S • + M (0) # complete from (4) and S(0)(2)
(6) P → S • (0) # complete from (4) and S(0)(1)
S(2): 2 + • 3 * 4
(1) S → S + • M (0) # scan from S(1)(5)(2) M → • M * T (2) # predict from (1)
(3) M → • T (2) # predict from (1)
(4) T → • number (2) # predict from (3)
S(3): 2 + 3 • * 4
(1) T → number • (2) # scan from S(2)(4)(2) M → T • (2) # complete from (1) and S(2)(3)
(3) M → M • * T (2) # complete from (2) and S(2)(2)
(4) S → S + M • (0) # complete from (2) and S(2)(1)
(5) S → S • + M (0) # complete from (4) and S(0)(2)
(6) P → S • (0) # complete from (4) and S(0)(1)
S(4): 2 + 3 * • 4
(1) M → M * • T (2) # scan from S(3)(3)(2) T → • number (4) # predict from (1)
S(5): 2 + 3 * 4 •
(1) T → number • (4) # scan from S(4)(2)(2) M → M * T • (2) # complete from (1) and S(4)(1)
(3) M → M • * T (2) # complete from (2) and S(2)(2)
(4) S → S + M • (0) # complete from (2) and S(2)(1)
(5) S → S • + M (0) # complete from (4) and S(0)(2)
(6) P → S • (0) # complete from (4) and S(0)(1)
The state (P → S •, 0) represents a completed parse. This state also appears in S(3) and S(1), which are complete sentences.
See also
- CYK algorithmCYK algorithmThe Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars. It employs bottom-up parsing and dynamic programming....
- Context-free grammarContext-free grammarIn formal language theory, a context-free grammar is a formal grammar in which every production rule is of the formwhere V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals ....
- Parsing Algorithms
C Implementations
- 'early' An Earley parser CC (programming language)C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
-library. - 'C Earley Parser' An Earley parser CC (programming language)C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
.
Java Implementations
- PEN A Java library that implements the Earley algorithm.
- Pep A Java library that implements the Earley algorithm and provides charts and parse trees as parsing artifacts.
- http://www.cs.umanitoba.ca/~comp4190/Earley/Earley.java A Java implementation of Earley parser.
Perl Implementations
- Marpa::XS and Marpa::PP, PerlPerlPerl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
modules, incorporating improvements made to the Earley algorithm by Joop Leo, and by Aycock and Horspool. - Parse::Earley An PerlPerlPerl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...
module that implements Jay Earley's original algorithm.
Python Implementations
- Charty a PythonPython (programming language)Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
implementation of an Earley parser. - NLTK a PythonPython (programming language)Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
toolkit that has an Earley parser. - Spark an Object Oriented "little language framework" for PythonPython (programming language)Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
that implements an Earley parser.
Common Lisp Implementations
- CL-EARLEY-PARSER A Common Lisp library that implements an Earley parser.
Scheme/Racket Implementations
- Charty-Racket A Scheme / Racket implementation of an Earley parser.