LL parser
Encyclopedia
An LL parser is a top-down
parser for a subset of the context-free grammar
s. It parses the input from Left to right, and constructs a Leftmost derivation of the sentence (hence LL, compared with LR parser
). The class of grammars which are parsable in this way is known as the LL grammars.
The remainder of this article describes the table-based kind of parser, the alternative being a recursive descent parser
which is usually coded by hand (although not always; see e.g. ANTLR
for an LL(*) recursive-descent parser generator).
An LL parser is called an LL(k) parser if it uses k tokens of lookahead
when parsing a sentence. If such a parser exists for a certain grammar and it can parse sentences of this grammar without backtracking
then it is called an LL(k) grammar. A language that has an LL(k) grammar is known as an LL(k) language. There are LL(k+n) languages that are not LL(k) languages. A corollary of this is that not all context-free languages are LL(k) languages.
LL(1) grammars are very popular because the corresponding LL parsers only need to look at the next token to make their parsing decisions. Languages based on grammars with a high value of k have traditionally been considered to be difficult to parse, although this is less true now given the availability and widespread use of parser generators supporting LL(k) grammars for arbitrary k.
An LL parser is called an LL(*) parser if it is not restricted to a finite k tokens of lookahead, but can make parsing decisions by recognizing whether the following tokens belong to a regular language
(for example by use of a Deterministic Finite Automaton).
.
The parser consists of
The parser applies the rule found in the table by matching the top-most symbol on the stack (row) with the current symbol in the input stream (column).
When the parser starts, the stack already contains two symbols:
[ S, $ ]
where '$' is a special terminal to indicate the bottom of the stack and the end of the input stream, and 'S' is the start symbol of the grammar. The parser will attempt to rewrite the contents of this stack to what it sees on the input stream. However, it only keeps on the stack what still needs to be rewritten.
and parse the following input:
The parsing table for this grammar looks as follows:
(Note that there is also a column for the special terminal, represented here as $, that is used to indicate the end of the input stream.)
Thus, in its first step, the parser reads the input symbol '(' and the stack-top symbol 'S'. The parsing table instruction comes from the column headed by the input symbol '(' and the row headed by the stack-top symbol 'S'; this cell contains '2', which instructs the parser to apply rule (2). The parser has to rewrite 'S' to '( S + F )' on the stack and write the rule number 2 to the output. The stack then becomes:
[ (, S, +, F, ), $ ]
Since the '(' from the input stream did not match the top-most symbol, 'S', from the stack, it was not removed, and remains the next-available input symbol for the following step.
In the second step, the parser removes the '(' from its input stream and from its stack, since they match. The stack now becomes:
[ S, +, F, ), $ ]
Now the parser has an 'a' on its input stream and an 'S' as its stack top. The parsing table instructs it to apply rule (1) from the grammar and write the rule number 1 to the output stream. The stack becomes:
[ F, +, F, ), $ ]
The parser now has an 'a' on its input stream and an 'F' as its stack top. The parsing table instructs it to apply rule (3) from the grammar and write the rule number 3 to the output stream. The stack becomes:
[ a, +, F, ), $ ]
In the next two steps the parser reads the 'a' and '+' from the input stream and, since they match the next two items on the stack, also removes them from the stack. This results in:
[ F, ), $ ]
In the next three steps the parser will replace 'F' on the stack by 'a', write the rule number 3 to the output stream and remove the 'a' and ')' from both the stack and the input stream. The parser thus ends with '$' on both its stack and its input stream.
In this case the parser will report that it has accepted the input string and write the following list of rule numbers to the output stream:
This is indeed a list of rules for a leftmost derivation of the input string, which is:
Remarks
As can be seen from the example the parser performs three types of steps depending on whether the top of the stack is a nonterminal, a terminal or the special symbol $:
These steps are repeated until the parser stops, and then it will have either completely parsed the input and written a leftmost derivation to the output stream or it will have reported an error.
Constructing an LL(1) parsing table
In order to fill the parsing table, we have to establish what grammar rule the parser should choose if it sees a nonterminal A on the top of its stack and a symbol a on its input stream. It is easy to see that such a rule should be of the form A → w and that the language corresponding to w should have at least one string starting with a. For this purpose we define the First-set of w, written here as Fi(w), as the set of terminals that can be found at the start of some string in w, plus ε if the empty string also belongs to w. Given a grammar with the rules A1 → w1, ..., An → wn, we can compute the Fi(wi) and Fi(Ai) for every rule as follows:
Unfortunately, the First-sets are not sufficient to compute the parsing table. This is because a right-hand side w of a rule might ultimately be rewritten to the empty string. So the parser should also use the rule A → w if ε is in Fi(w) and it sees on the input stream a symbol that could follow A. Therefore we also need the Follow-set of A, written as Fo(A) here, which is defined as the set of terminals a such that there is a string of symbols αAaβ that can be derived from the start symbol. Computing the Follow-sets for the nonterminals in a grammar can be done as follows:
Now we can define exactly which rules will be contained where in the parsing table. If T[A, a] denotes the entry in the table for nonterminal A and terminal a, then
If the table contains at most one rule in every one of its cells, then the parser will always know which rule it has to use and can therefore parse strings without backtracking. It is in precisely this case that the grammar is called an LL(1) grammar.
Constructing an LL(k) parsing table
Until the mid 1990s, it was widely believed that LL(k) parsing (for k > 1) was impractical, since the parse table
would have exponential
size in k in the worst case. This perception changed gradually after the release of the PCCTS around 1992, when it was demonstrated that many programming language
s can be parsed efficiently by an LL(k) parser without triggering the worst-case behavior of the parser. Moreover, in certain cases LL parsing is feasible even with unlimited lookahead. By contrast, traditional parser generators, like yacc
use LALR(1)
parse tables to construct a restricted LR parser
with a fixed one-token lookahead.
Conflicts
As described in the introduction, LL(1) parsers recognize languages that have LL(1) grammars, which are a special case of context-free grammars (CFG's); LL(1) parsers cannot recognize all context-free languages. The LL(1) languages are exactly those recognized by deterministic pushdown automata restricted to a single state . In order for a CFG to be an LL(1) grammar, certain conflicts must not arise, which we describe in this section.
S -> A 'a' 'b'
A -> 'a' | epsilon
E -> E '+' term | alt1 | alt2
A common left-factor is "factored out".
A -> X | X Y Z
becomes
A -> X B
B -> Y Z | ε
Can be applied when two alternatives start with the same symbol like a FIRST/FIRST conflict.
Substituting a rule into another rule to remove indirect or FIRST/FOLLOW conflicts.
Note that this may cause a FIRST/FIRST conflict.
A simple example for left recursion removal:
The following production rule has left recursion on E
E -> E '+' T
-> T
This rule is nothing but list of T's separated by '+'. In a regular expression form T ('+' T)*.
So the rule could be rewritten as
E -> T Z
Z -> '+' T Z
-> ε
Now there is no left recursion and no conflicts on either of the rules.
However, not all CFGs have an equivalent LL(k)-grammar, e.g.:
S -> A | B
A -> 'a' A 'b' | ε
B -> 'a' B 'b' 'b' | ε
It can be shown that there does not exist any LL(k)-grammar accepting the language generated by this grammar.
See also
External links
Top-down parsing
Top-down parsing is a type of parsing strategy where in one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar...
parser for a subset of the context-free grammar
Context-free grammar
In formal language theory, a context-free grammar is a formal grammar in which every production rule is of the formwhere V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals ....
s. It parses the input from Left to right, and constructs a Leftmost derivation of the sentence (hence LL, compared with LR parser
LR parser
In computer science, an LR parser is a parser that reads input from Left to right and produces a Rightmost derivation. The term LR parser is also used; where the k refers to the number of unconsumed "look ahead" input symbols that are used in making parsing decisions...
). The class of grammars which are parsable in this way is known as the LL grammars.
The remainder of this article describes the table-based kind of parser, the alternative being a recursive descent parser
Recursive descent parser
A recursive descent parser is a top-down parser built from a set of mutually-recursive procedures where each such procedure usually implements one of the production rules of the grammar...
which is usually coded by hand (although not always; see e.g. ANTLR
ANTLR
In computer-based language recognition, ANTLR , or ANother Tool for Language Recognition, is a parser generator that uses LL parsing. ANTLR is the successor to the Purdue Compiler Construction Tool Set , first developed in 1989, and is under active development...
for an LL(*) recursive-descent parser generator).
An LL parser is called an LL(k) parser if it uses k tokens of lookahead
Lookahead
Lookahead is a tool in algorithms for looking ahead a few more input items before making a cost effective decision at one stage of the algorithm.- Lookahead in search problems :...
when parsing a sentence. If such a parser exists for a certain grammar and it can parse sentences of this grammar without backtracking
Backtracking
Backtracking is a general algorithm for finding all solutions to some computational problem, that incrementally builds candidates to the solutions, and abandons each partial candidate c as soon as it determines that c cannot possibly be completed to a valid solution.The classic textbook example...
then it is called an LL(k) grammar. A language that has an LL(k) grammar is known as an LL(k) language. There are LL(k+n) languages that are not LL(k) languages. A corollary of this is that not all context-free languages are LL(k) languages.
LL(1) grammars are very popular because the corresponding LL parsers only need to look at the next token to make their parsing decisions. Languages based on grammars with a high value of k have traditionally been considered to be difficult to parse, although this is less true now given the availability and widespread use of parser generators supporting LL(k) grammars for arbitrary k.
An LL parser is called an LL(*) parser if it is not restricted to a finite k tokens of lookahead, but can make parsing decisions by recognizing whether the following tokens belong to a regular language
Regular language
In theoretical computer science and formal language theory, a regular language is a formal language that can be expressed using regular expression....
(for example by use of a Deterministic Finite Automaton).
General case
The parser works on strings from a particular context-free grammarContext-free grammar
In formal language theory, a context-free grammar is a formal grammar in which every production rule is of the formwhere V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals ....
.
The parser consists of
- an input buffer, holding the input string (built from the grammar)
- a stack on which to store the terminalsTerminal and nonterminal symbolsIn computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules that constitute a formal grammar...
and non-terminalsTerminal and nonterminal symbolsIn computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules that constitute a formal grammar...
from the grammar yet to be parsed - a parsing table which tells it what (if any) grammar rule to apply given the symbols on top of its stack and the next input token
The parser applies the rule found in the table by matching the top-most symbol on the stack (row) with the current symbol in the input stream (column).
When the parser starts, the stack already contains two symbols:
[ S, $ ]
where '$' is a special terminal to indicate the bottom of the stack and the end of the input stream, and 'S' is the start symbol of the grammar. The parser will attempt to rewrite the contents of this stack to what it sees on the input stream. However, it only keeps on the stack what still needs to be rewritten.
Set up
To explain its workings we will consider the following small grammar:- S → F
- S → ( S + F )
- F → a
and parse the following input:
- ( a + a )
The parsing table for this grammar looks as follows:
( | ) | a | + | $ | |
S | 2 | - | 1 | - | - |
F | - | - | 3 | - | - |
(Note that there is also a column for the special terminal, represented here as $, that is used to indicate the end of the input stream.)
Parsing procedure
In each step, the parser reads the next-available symbol from the input stream, and the top-most symbol from the stack. If the input symbol and the stack-top symbol match, the parser discards them both, leaving only the unmatched symbols in the input stream and on the stack.Thus, in its first step, the parser reads the input symbol '(' and the stack-top symbol 'S'. The parsing table instruction comes from the column headed by the input symbol '(' and the row headed by the stack-top symbol 'S'; this cell contains '2', which instructs the parser to apply rule (2). The parser has to rewrite 'S' to '( S + F )' on the stack and write the rule number 2 to the output. The stack then becomes:
[ (, S, +, F, ), $ ]
Since the '(' from the input stream did not match the top-most symbol, 'S', from the stack, it was not removed, and remains the next-available input symbol for the following step.
In the second step, the parser removes the '(' from its input stream and from its stack, since they match. The stack now becomes:
[ S, +, F, ), $ ]
Now the parser has an 'a' on its input stream and an 'S' as its stack top. The parsing table instructs it to apply rule (1) from the grammar and write the rule number 1 to the output stream. The stack becomes:
[ F, +, F, ), $ ]
The parser now has an 'a' on its input stream and an 'F' as its stack top. The parsing table instructs it to apply rule (3) from the grammar and write the rule number 3 to the output stream. The stack becomes:
[ a, +, F, ), $ ]
In the next two steps the parser reads the 'a' and '+' from the input stream and, since they match the next two items on the stack, also removes them from the stack. This results in:
[ F, ), $ ]
In the next three steps the parser will replace 'F' on the stack by 'a', write the rule number 3 to the output stream and remove the 'a' and ')' from both the stack and the input stream. The parser thus ends with '$' on both its stack and its input stream.
In this case the parser will report that it has accepted the input string and write the following list of rule numbers to the output stream:
- [ 2, 1, 3, 3 ]
This is indeed a list of rules for a leftmost derivation of the input string, which is:
- S → ( S + F ) → ( F + F ) → ( a + F ) → ( a + a )
Parser implementation in C++
Below follows a C++ implementation of a table-based LL parser for the example language:Remarks
As can be seen from the example the parser performs three types of steps depending on whether the top of the stack is a nonterminal, a terminal or the special symbol $:
- If the top is a nonterminal then it looks up in the parsing table on the basis of this nonterminal and the symbol on the input stream which rule of the grammar it should use to replace it with on the stack. The number of the rule is written to the output stream. If the parsing table indicates that there is no such rule then it reports an error and stops.
- If the top is a terminal then it compares it to the symbol on the input stream and if they are equal they are both removed. If they are not equal the parser reports an error and stops.
- If the top is $ and on the input stream there is also a $ then the parser reports that it has successfully parsed the input, otherwise it reports an error. In both cases the parser will stop.
These steps are repeated until the parser stops, and then it will have either completely parsed the input and written a leftmost derivation to the output stream or it will have reported an error.
Constructing an LL(1) parsing table
In order to fill the parsing table, we have to establish what grammar rule the parser should choose if it sees a nonterminal A on the top of its stack and a symbol a on its input stream. It is easy to see that such a rule should be of the form A → w and that the language corresponding to w should have at least one string starting with a. For this purpose we define the First-set of w, written here as Fi(w), as the set of terminals that can be found at the start of some string in w, plus ε if the empty string also belongs to w. Given a grammar with the rules A1 → w1, ..., An → wn, we can compute the Fi(wi) and Fi(Ai) for every rule as follows:
- initialize every Fi(wi) and Fi(Ai) with the empty set
- add Fi(wi) to Fi(Ai) for every rule Ai → wi, where Fi is defined as follows:
- Fi(a w' ) = { a } for every terminal a
- Fi(A w' ) = Fi(A) for every nonterminal A with ε not in Fi(A)
- Fi(A w' ) = Fi(A) \ { ε } ∪ Fi(w' ) for every nonterminal A with ε in Fi(A)
- Fi(ε) = { ε }
- add Fi(wi) to Fi(Ai) for every rule Ai → wi
- do steps 2 and 3 until all Fi sets stay the same.
Unfortunately, the First-sets are not sufficient to compute the parsing table. This is because a right-hand side w of a rule might ultimately be rewritten to the empty string. So the parser should also use the rule A → w if ε is in Fi(w) and it sees on the input stream a symbol that could follow A. Therefore we also need the Follow-set of A, written as Fo(A) here, which is defined as the set of terminals a such that there is a string of symbols αAaβ that can be derived from the start symbol. Computing the Follow-sets for the nonterminals in a grammar can be done as follows:
- initialize every Fo(Ai) with the empty set
- if there is a rule of the form Aj → wAiw' , then
- if the terminal a is in Fi(w' ), then add a to Fo(Ai)
- if ε is in Fi(w' ), then add Fo(Aj) to Fo(Ai)
- repeat step 2 until all Fo sets stay the same.
Now we can define exactly which rules will be contained where in the parsing table. If T[A, a] denotes the entry in the table for nonterminal A and terminal a, then
- T[A,a] contains the rule A → w if and only if
- a is in Fi(w) or
- ε is in Fi(w) and a is in Fo(A).
If the table contains at most one rule in every one of its cells, then the parser will always know which rule it has to use and can therefore parse strings without backtracking. It is in precisely this case that the grammar is called an LL(1) grammar.
Constructing an LL(k) parsing table
Until the mid 1990s, it was widely believed that LL(k) parsing (for k > 1) was impractical, since the parse table
Parsing table
A parsing table is the part of a parser that makes decisions about how to treat input tokens in compiler development.- Overview :A parsing table is a table describing what action its parser should take when a given input comes while it is in a given state...
would have exponential
Exponential function
In mathematics, the exponential function is the function ex, where e is the number such that the function ex is its own derivative. The exponential function is used to model a relationship in which a constant change in the independent variable gives the same proportional change In mathematics,...
size in k in the worst case. This perception changed gradually after the release of the PCCTS around 1992, when it was demonstrated that many programming language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....
s can be parsed efficiently by an LL(k) parser without triggering the worst-case behavior of the parser. Moreover, in certain cases LL parsing is feasible even with unlimited lookahead. By contrast, traditional parser generators, like yacc
Yacc
The computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser based on an analytic grammar written in a notation similar to BNF.Yacc used to be available as...
use LALR(1)
LALR parser
In computer science, an LALR parser is a type of LR parser based on a finite-state-automata concept. The data structure used by an LALR parser is a pushdown automaton...
parse tables to construct a restricted LR parser
LR parser
In computer science, an LR parser is a parser that reads input from Left to right and produces a Rightmost derivation. The term LR parser is also used; where the k refers to the number of unconsumed "look ahead" input symbols that are used in making parsing decisions...
with a fixed one-token lookahead.
Conflicts
As described in the introduction, LL(1) parsers recognize languages that have LL(1) grammars, which are a special case of context-free grammars (CFG's); LL(1) parsers cannot recognize all context-free languages. The LL(1) languages are exactly those recognized by deterministic pushdown automata restricted to a single state . In order for a CFG to be an LL(1) grammar, certain conflicts must not arise, which we describe in this section.
Terminology
Let A be a non-terminal. FIRST(A) is (defined to be) the set of terminals that can appear in the first position of any string derived from A. FOLLOW(A) is the union over FIRST(B) where B is any non-terminal that immediately follows A in the right hand side of a production rule.LL(1) Conflicts
There are 3 types of LL(1) conflicts:- FIRST/FIRST conflict
- The FIRST sets of two different non-terminals intersect.
- FIRST/FOLLOW conflict
- The FIRST and FOLLOW set of a grammar rule overlap. With an epsilon in the FIRST set it is unknown which alternative to select.
- An example of an LL(1) conflict:
S -> A 'a' 'b'
A -> 'a' | epsilon
- The FIRST set of A now is { 'a' epsilon } and the FOLLOW set { 'a' }.
- left-recursion
- Left recursionLeft recursionIn computer science, left recursion is a special case of recursion.In terms of context-free grammar, a non-terminal r is left-recursive if the left-most symbol in any of r’s ‘alternatives’ either immediately or through some other non-terminal definitions rewrites to r again.- Definition :"A...
will cause a FIRST/FIRST conflict with all alternatives.
E -> E '+' term | alt1 | alt2
Solutions to LL(1) Conflicts
- Left-factoring
A common left-factor is "factored out".
A -> X | X Y Z
becomes
A -> X B
B -> Y Z | ε
Can be applied when two alternatives start with the same symbol like a FIRST/FIRST conflict.
- Substitution
Substituting a rule into another rule to remove indirect or FIRST/FOLLOW conflicts.
Note that this may cause a FIRST/FIRST conflict.
- Left recursion removal
A simple example for left recursion removal:
The following production rule has left recursion on E
E -> E '+' T
-> T
This rule is nothing but list of T's separated by '+'. In a regular expression form T ('+' T)*.
So the rule could be rewritten as
E -> T Z
Z -> '+' T Z
-> ε
Now there is no left recursion and no conflicts on either of the rules.
However, not all CFGs have an equivalent LL(k)-grammar, e.g.:
S -> A | B
A -> 'a' A 'b' | ε
B -> 'a' B 'b' 'b' | ε
It can be shown that there does not exist any LL(k)-grammar accepting the language generated by this grammar.
See also
- Comparison of parser generatorsComparison of parser generatorsThis is a list of notable lexer generators and parser generators for various language classes.- Regular languages :- Deterministic context-free languages :-Parsing expression grammars, deterministic boolean grammars:...
- Parse treeParse treeA concrete syntax tree or parse tree or parsing treeis an ordered, rooted tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the...
- Top-down parsingTop-down parsingTop-down parsing is a type of parsing strategy where in one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar...
- Bottom-up parsingBottom-up parsingBottom-up parsing is a strategy for analyzing unknown information that attempts to identify the most fundamental units first, and then to infer higher-order structures from them...
External links
Comparison of parser generators
This is a list of notable lexer generators and parser generators for various language classes.- Regular languages :- Deterministic context-free languages :-Parsing expression grammars, deterministic boolean grammars:...
Parse tree
A concrete syntax tree or parse tree or parsing treeis an ordered, rooted tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the...
Top-down parsing
Top-down parsing is a type of parsing strategy where in one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar...
Bottom-up parsing
Bottom-up parsing is a strategy for analyzing unknown information that attempts to identify the most fundamental units first, and then to infer higher-order structures from them...
- A tutorial on implementing LL(1) parsers in C#
- Parsing Simulator This simulator is used to generate parsing tables LL1 and to resolve the exercises of the book.
- LL(1) DSL PEG parser (toolkit framework)