Bottom-up parsing - AbsoluteAstronomy.com

Bottom-up parsing is a strategy for analyzing unknown information that attempts to identify the most fundamental units first, and then to infer higher-order structures from them. The name comes from the concept of a parse tree

Parse tree

A concrete syntax tree or parse tree or parsing treeis an ordered, rooted tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the...

, in which the most fundamental units are at the bottom, and structures composed of them are in successively higher layers, until at the top of the tree a single unit, or start symbol, comprises all of the information being analyzed.

Bottom-up parsing occurs in the analysis of both natural language

Natural language

In the philosophy of language, a natural language is any language which arises in an unpremeditated fashion as the result of the innate facility for language possessed by the human intellect. A natural language is typically used for communication, and may be spoken, signed, or written...

s and computer languages. It is contrasted with top-down parsing

Top-down parsing

Top-down parsing is a type of parsing strategy where in one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar...

, in which complex units are divided into simpler parts until all of the information is parsed.

Linguistics

In linguistics
Linguistics
Linguistics is the scientific study of human language. Linguistics can be broadly broken into three categories or subfields of study: language form, language meaning, and language in context....

, an example of bottom-up parsing would be analyzing a sentence

Sentence (linguistics)

In the field of linguistics, a sentence is an expression in natural language, and often defined to indicate a grammatical unit consisting of one or more words that generally bear minimal syntactic relation to the words that precede or follow it...

by identifying words first, and then using properties of the words to infer grammatical

Grammar

In linguistics, grammar is the set of structural rules that govern the composition of clauses, phrases, and words in any given natural language. The term refers also to the study of such rules, and this field includes morphology, syntax, and phonology, often complemented by phonetics, semantics,...

relations and phrase structures

Phrase structure rules

Phrase-structure rules are a way to describe a given language's syntax. They are used to break down a natural language sentence into its constituent parts namely phrasal categories and lexical categories...

to build a parse tree

Parse tree

of the complete sentence.

Example

The English

English language

English is a West Germanic language that arose in the Anglo-Saxon kingdoms of England and spread into what was to become south-east Scotland under the influence of the Anglian medieval kingdom of Northumbria...

sentence "John hit the ball" may be analysed by bottom-up parsing. In this case, the most fundamental units (ignoring conjugation

Grammatical conjugation

In linguistics, conjugation is the creation of derived forms of a verb from its principal parts by inflection . Conjugation may be affected by person, number, gender, tense, aspect, mood, voice, or other grammatical categories...

and declension

Declension

In linguistics, declension is the inflection of nouns, pronouns, adjectives, and articles to indicate number , case , and gender...

) are the words: "John", "hit", "the" and "ball".

A simplified grammar for English, sufficient for this sentence, is:

A sentence
Sentence (linguistics)
In the field of linguistics, a sentence is an expression in natural language, and often defined to indicate a grammatical unit consisting of one or more words that generally bear minimal syntactic relation to the words that precede or follow it...

can be a noun phrase
Noun phrase
In grammar, a noun phrase, nominal phrase, or nominal group is a phrase based on a noun, pronoun, or other noun-like word optionally accompanied by modifiers such as adjectives....

followed by a verb phrase
Verb phrase
In linguistics, a verb phrase or VP is a syntactic unit composed of at least one verb and the dependents of that verb. One can distinguish between two types of VPs, finite VPs and non-finite VPs . While phrase structure grammars acknowledge both, dependency grammars reject the existence of a...

(S → NP VP)
A noun phrase can be a noun
Noun
In linguistics, a noun is a member of a large, open lexical category whose members can occur as the main word in the subject of a clause, the object of a verb, or the object of a preposition .Lexical categories are defined in terms of how their members combine with other kinds of...

by itself (NP → N)
A noun phrase can also be a determiner
Determiner (class)
A determiner is a noun-modifier that expresses the reference of a noun or noun-phrase in the context, rather than attributes expressed by adjectives...

followed by a noun (NP → Det N)
A verb phrase can be a verb
Verb
A verb, from the Latin verbum meaning word, is a word that in syntax conveys an action , or a state of being . In the usual description of English, the basic form, with or without the particle to, is the infinitive...

followed by a noun phrase (VP → V NP)
Nouns, determiners and verbs are all words (N → word, Det → word, V → word)

Proceeding bottom-up from these words, we find:

Rule used	Words parsed	Sentence now
5	In this case, "John" and "ball" are nouns, "the" (the definite article Definite Article Definite Article is the title of British comedian Eddie Izzard's 1996 performance released on VHS. It was recorded on different nights at the Shaftesbury Theatre... ) is a determiner, and "hit" is a verb.	N V Det N
3	"the ball" is a noun phrase.	N V NP
4	"hit the ball" is a verb phrase.	N VP
2	"John" is a noun phrase by itself.	NP VP
1	"John hit the ball" is a complete sentence.	S

Other parsings are possible; for example, "ball" could have been parsed as a noun phrase on its own. However, no other parsing can lead to a complete sentence using these simplified rules, as once "ball" is chosen to be a noun phrase, there is no rule that lets us parse "the". A brute-force search

Brute-force search

In computer science, brute-force search or exhaustive search, also known as generate and test, is a trivial but very general problem-solving technique that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem's...

can be used to find all possible parsings of the sentence; if more than one parsing is possible, the sentence is ambiguous

Ambiguity

Ambiguity of words or phrases is the ability to express more than one interpretation. It is distinct from vagueness, which is a statement about the lack of precision contained or available in the information.Context may play a role in resolving ambiguity...

-- not uncommon in natural languages.

Computer science

In programming language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....

compilers, bottom-up parsing is a parsing method that works by identifying terminal symbols first, and combines them successively to produce nonterminals. The productions of the parser can be used to build a parse tree

Parse tree

of a program written in human-readable

Human-readable

A human-readable medium or human-readable format is a representation of data or information that can be naturally read by humans.In computing, human-readable data is often encoded as ASCII or Unicode text, rather than presented in a binary representation...

source code

Source code

In computer science, source code is text written using the format and syntax of the programming language that it is being written in. Such a language is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source...

that can be compiled to assembly language

Assembly language

An assembly language is a low-level programming language for computers, microprocessors, microcontrollers, and other programmable devices. It implements a symbolic representation of the machine codes and other constants needed to program a given CPU architecture...

or pseudocode

Pseudocode

In computer science and numerical computation, pseudocode is a compact and informal high-level description of the operating principle of a computer program or other algorithm. It uses the structural conventions of a programming language, but is intended for human reading rather than machine reading...

.

Different computer languages require different parsing techniques, although it is not uncommon to use a parsing technique that is more powerful than that actually required.

It is common for bottom-up parsers to take the form of general parsing engines, which can either parse, or generate a parser for, a specific programming language

Programming language

A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....

given a specification of its grammar. Perhaps the most well known generalized parser generators are YACC

Yacc

The computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser based on an analytic grammar written in a notation similar to BNF.Yacc used to be available as...

and GNU bison

GNU bison

GNU bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification of a context-free language, warns about any parsing ambiguities, and generates a parser which reads sequences of tokens and decides whether the sequence conforms to the syntax...

Example

This trivial grammar defines simple addition expressions:

A complete expression (a sum) can be an addend, the plus sign, and another addend (S → A "+" A)
An addend can be a letter (a variable) (A → letter)
An addend can also be a number (A → N)
An addend can also be a sum itself (A → S)
A number is one or more digits (N → digit+)

(The plus sign in rule 5 signifies repetition: "one or more digits". To distinguish it from this special use, the plus sign in rule 1 is placed inside quotation marks.)

The fundamental units of this grammar are single letters, the plus sign, and groups of one or more digits. So, in the input "a + x + 12", the fundamental units are "a", "+", "x", "+", and "12". A parsing strategy might look like this:

Rule used	Symbols parsed	Expression now
5	"12" is a number.	a + x + N
2, 3	"a", "x" and "12" are addends.	A + A + A
1	"a + x" is a complete sum.	S + A
4	The sum "a + x" is also an addend.	A + A
1	"a + x + 12" is a complete sum.	S

Note that even though a complete expression was found in the third step, there was still some input remaining, so the parsing had to continue.

As with the linguistics example above, other parsings are possible; specifically, at the third step, "x + 12" could also have been legitimately parsed as a sum. This would have also resulted in a successfully completed parsing; thus, the grammar given is ambiguous. Unlike natural languages, computer languages cannot be ambiguous, so bottom-up parsing for computer languages must have additional rules, such as "the leftmost possible replacement is the one chosen". Additionally, a computer language parser will rarely examine the entire input at once, since the input may be a very large computer program; almost all parsers read the input from start to end (usually called "left to right"). However, the ability to check one or more upcoming symbols ("lookahead") before applying a rule increases the expressive power of a grammar.

Type of bottom-up parsers

The common classes of bottom-up parsers are:

LR parser
LR parser
In computer science, an LR parser is a parser that reads input from Left to right and produces a Rightmost derivation. The term LR parser is also used; where the k refers to the number of unconsumed "look ahead" input symbols that are used in making parsing decisions...
- LR(0) - No lookahead symbol
- SLR(1) - Simple with one lookahead symbol
- LALR(1)
  LALR parser
  In computer science, an LALR parser is a type of LR parser based on a finite-state-automata concept. The data structure used by an LALR parser is a pushdown automaton...
  
  - Lookahead bottom up, not as powerful as full LR(1) but simpler to implement. YACC
  Yacc
  The computer program yacc is a parser generator developed by Stephen C. Johnson at AT&T for the Unix operating system. The name is an acronym for "Yet Another Compiler Compiler." It generates a parser based on an analytic grammar written in a notation similar to BNF.Yacc used to be available as...
  
  deals with this kind of grammar.
- LR(1)
  Canonical LR parser
  A canonical LR parser or LR parser is an LR parser whose parsing tables are constructed in a similar way as with LR parsers except that the items in the item sets also contain a lookahead, i.e., a terminal that is expected by the parser after the right-hand side of the rule...
  
  - Most general grammar, but most complex to implement.
- LR() - (where is a positive integer) indicates an LR parser with lookahead symbols; while grammars can be designed that require more than 1 lookahead, practical grammars try to avoid this because increasing can theoretically require exponentially more code and data space (in practice, this may not be as bad). Also, the class of LR() languages is the same as that of LR(1) languages.
Precedence parsers
- Simple precedence parser
  Simple precedence parser
  In computer science, a Simple precedence parser is a type of bottom-up parser for context-free grammars that can be used only by Simple precedence grammars....
- Operator-precedence parser
  Operator-precedence parser
  An operator precedence parser is a bottom-up parser that interprets an operator-precedence grammar. For example, most calculators use operator precedence parsers to convert from the human-readable infix notation with order of operations format into an internally optimized computer-readable format...
- Extended precedence parser

Bottom-up parsing can also be done by backtracking

Backtracking

Backtracking is a general algorithm for finding all solutions to some computational problem, that incrementally builds candidates to the solutions, and abandons each partial candidate c as soon as it determines that c cannot possibly be completed to a valid solution.The classic textbook example...

Shift-reduce parsers

The most common bottom-up parsers are the shift-reduce parsers. These parsers examine the input tokens and either shift (push) them onto a stack

Stack

-Mathematics:* Stack , general category-theoretical concept to formalise "pull-back" operations in geometry and algebra* Algebraic stack, a generalisation of scheme and algebraic space in algebraic geometry; a specific type of the above-Computers:...

or reduce elements at the top of the stack, replacing a right-hand side by a left-hand side.

Action table

Often an action (or parse) table is constructed which helps the parser determine what to do next. The following is a description of what can be held in an action table.

Actions

Shift - push token onto stack
Reduce - remove handle from stack and push on corresponding nonterminal
Accept - recognize sentence when stack contains only the distinguished symbol and input is empty
Error - happens when none of the above is possible; means original input was not a sentence

Shift and reduce

A shift-reduce parser uses a stack to hold the grammar symbols while awaiting reduction. During the operation of the parser, symbols from the input are shifted onto the stack. If a prefix of the symbols on top of the stack matches the RHS of a grammar rule which is the correct rule to use within the current context, then the parser reduces the RHS of the rule to its LHS, replacing the RHS symbols on top of the stack with the nonterminal occurring on the LHS of the rule. This shift-reduce process continues until the parser terminates, reporting either success or failure. It terminates with success when the input is legal and is accepted by the parser. It terminates with failure if an error is detected in the input.

The parser is a stack automaton which is in one of several discrete states. In reality, in the case of LR parsing, the parse stack contains states, rather than grammar symbols. However, since each state corresponds to a unique grammar symbol, the state stack can be mapped onto the grammar symbol stack mentioned earlier.

Algorithm: Shift-reduce parsing

Start with the sentence to be parsed as the initial sentential form
Until the sentential form is the start symbol do:
1. Scan through the input until we recognise something that corresponds to the RHS of one of the production rules (this is called a handle)
2. Apply a production rule in reverse; i.e., replace the RHS of the rule which appears in the sentential form with the LHS of the rule (an action known as a reduction)

In step 2.1 above we are "shifting" the input symbols to one side as we move through them; hence a parser which operates by repeatedly applying steps 2.1 and 2.2 above is known as a shift-reduce parser.

A shift-reduce parser is most commonly implemented using a stack, where we proceed as follows:

start with an empty stack
a "shift" action corresponds to pushing the current input symbol onto the stack
a "reduce" action occurs when we have a handle on top of the stack. To perform the reduction, we pop the handle off the stack and replace it with the nonterminal on the LHS of the corresponding rule.

Example

Take the grammar:



 Sentence   --> NounPhrase VerbPhrase

 NounPhrase --> Art Noun

 VerbPhrase --> Verb | Adverb Verb

 Art        --> the | a | ...

 Verb       --> jumps | sings | ...

 Noun       --> dog | cat | ...

And the input:

the dog jumps

Then the bottom up parsing is:



Stack                  Input Sequence

                     (the dog jumps)

(the)                  (dog jumps)      SHIFT word onto stack

(Art)                  (dog jumps)      REDUCE using grammar rule

(Art dog)              (jumps)          SHIFT..

(Art Noun)             (jumps)          REDUCE..

(NounPhrase)           (jumps)          REDUCE

(NounPhrase jumps)                    SHIFT

(NounPhrase Verb)                     REDUCE

(NounPhrase VerbPhrase)               REDUCE

(Sentence)                            SUCCESS

External links

Parsing Simulator This simulator is used to generate examples of shift-reduce parsing
An example of shift-reduce parsing (which is a type of bottom up parsing), with a small grammar, state diagram, and C language code to implement the parser
Course notes on shift reduce parsing
A good non-technical tutorial in the context of natural (human) languages (Instead use an archived version of the page)
A discussion of shift-reduce conflicts in bottom up parsers. A knowledgeable but technical article.
Yet another bottom-up parsing illustration

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.