Stochastic context-free grammar
Encyclopedia
A stochastic context-free grammar (SCFG; also probabilistic context-free grammar, PCFG) is a context-free grammar
in which each production is augmented with a probability. The probability of a derivation (parse) is then the product of the probabilities of the productions used in that derivation; thus some derivations are more consistent with the stochastic grammar than others.
SCFGs extend context-free grammars in the same way that hidden Markov model
s extend regular grammar
s.
SCFGs have application in areas as diverse as Natural language processing
to the study of RNA
molecules. SCFGs are a specialized form of weighted context-free grammar
s.
finds the Viterbi parse
of a sequence for a given SCFG. The Viterbi parse is the most likely derivation (parse) of the sequence by the given SCFG.
The Inside-Outside algorithm is an analogue of the Forward-Backward algorithm, and can be used to compute the total probability of all derivations that are consistent with a given sequence, based on some SCFG. This is equivalent to the probability of the SCFG generating the sequence, and is intuitively a measure of how consistent the sequence is with the given grammar.
The Inside/Outside algorithms
can also be used to compute the probabilities that a given production will be used in a random derivation of a sequence. This is used as part of an Expectation-maximization algorithm
to learn the maximum likelihood
probabilities for an SCFG based on a set of training sequences that the SCFG should model. The algorithm is analogous to that used by hidden Markov models.
is an example of an ambiguous parse used deliberately to humorous effect, such as in the newspaper headline "Iraqi Head Seeks Arms".
One strategy of dealing with ambiguous parses (originating with grammarians as early as Pāṇini) is to add yet more rules, or prioritize them so that one rule takes precedence over others. This, however, has the drawback of proliferating the rules, often to the point where they become hard to manage. Another difficulty is overgeneration, where unlicensed structures are also generated. Probabilistic grammars circumvent these problems by using the frequency of various productions to order them, resulting in a "most likely" (winner-take-all) interpretation. As usage patterns are altered in diachronic
shifts, these probabilistic rules can be re-learned, thus upgrading the grammar.
One may construct a probabilistic grammar from a traditional formal syntax by assigning each non-terminal a probability taken from some distribution, to be eventually estimated from usage data. On most samples of broad language, probabilistic grammars that tune these probabilities from data typically outperform hand-crafted grammars (although some rule-based grammars are now approaching the accuracies of PCFG).
Here is a tiny example of a 2-rule PCFG grammar. Each rule is preceded by a probability that reflects the relative frequency with which it occurs.
Given this grammar, we can now say that the number of NPs expected while deriving VPs is 0.7 x 1 + 0.3 x 2 = 1.3.
The probabilities are typically computed by machine learning
programs that operate (learn or "train") on large volumes of annotated text ("corpora"), such as the Penn TreeBank, which contains a large collection of articles from the Wall Street Journal. As such, a probabilistic grammar's validity is constrained by context of the text used to train it; the definition of "most likely" is likely to vary with the context of discourse.
An example of a probabilistic CFG parser that has been trained using the Penn Treebank is the Stanford Statistical Parser of Klein and Manning, downloadable at http://nlp.stanford.edu/software/lex-parser.shtml
In particular, some speech recognition
systems use SCFGs to improve their probability estimate and thereby their performance.
of RNA
Secondary structure involves nucleotide
s within a single-stranded RNA molecule that are complementary to each other, and therefore base pair. This base pairing is biologically important to the proper function of the RNA molecule. Much of this base pairing can be represented in a context-free grammar (the major exception being pseudoknot
s).
For example, consider the following grammar, where a,c,g,u represents nucleotides and S is the start symbol (and only non-terminal):
This simple CFG represents an RNA molecule consisting entirely of two wholly complementary regions, in which only canonical complementary pairs are allowed (i.e. A-U and C-G).
By attaching probabilities to more sophisticated CFGs, it is possible to model bases or base pairings that are more or less consistent with an expected RNA molecule pattern. SCFGs are used to model the patterns in RNA gene families in the Rfam Database, and search genome sequences for likely additional members of these families. SCFGs have also been used to find RNA genes using comparative genomics. In this work, homologs of a potential RNA gene in two related organisms were inspected using SCFG techniques to see if their secondary structure is conserved. If it is, the sequence is likely to be an RNA gene, and the secondary structure is presumed to be conserved because of the functional needs of that RNA gene.
It has been shown that SCFGs could predict the secondary structure of an RNA molecule similarly to existing techniques: such SCFGs are used, for example, by the Stemloc
program.
(but still unproven) view, that a form of grammar, including a complete conceptual lexicon in certain versions, were hardwired from birth.
However, Gold's learnability result can easily be circumvented, if we either assume the learner learns a near-perfect approximation to the correct language or that the learner learns from typical input rather than arbitrarily devious input. In fact, it has been proved that simply receiving input from a speaker who produces positive instances randomly rather than according to a pre-specified devious plan leads to identifiability in the limit with probability 1.
Recently, probabilistic grammars appear to have gained some cognitive plausibility. It is well known that there are degrees of difficulty in accessing different syntactic structures (e.g. the Accessibility Hierarchy for relative clause
s). Probabilistic versions of Minimalist Grammars
have been used to compute information-theoretic entropy
values which appear to correlate well with psycholinguistic data on understandability and production difficulty.
Context-free grammar
In formal language theory, a context-free grammar is a formal grammar in which every production rule is of the formwhere V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals ....
in which each production is augmented with a probability. The probability of a derivation (parse) is then the product of the probabilities of the productions used in that derivation; thus some derivations are more consistent with the stochastic grammar than others.
SCFGs extend context-free grammars in the same way that hidden Markov model
Hidden Markov model
A hidden Markov model is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved states. An HMM can be considered as the simplest dynamic Bayesian network. The mathematics behind the HMM was developed by L. E...
s extend regular grammar
Regular grammar
In theoretical computer science, a regular grammar is a formal grammar that describes a regular language.- Strictly regular grammars :A right regular grammar is a formal grammar such that all the production rules in P are of one of the following forms:# B → a - where B is a non-terminal in N and...
s.
SCFGs have application in areas as diverse as Natural language processing
Natural language processing
Natural language processing is a field of computer science and linguistics concerned with the interactions between computers and human languages; it began as a branch of artificial intelligence....
to the study of RNA
RNA
Ribonucleic acid , or RNA, is one of the three major macromolecules that are essential for all known forms of life....
molecules. SCFGs are a specialized form of weighted context-free grammar
Weighted context-free grammar
A weighted context-free grammar is a context-free grammar where each production has a numeric weight associated with it. The weight of a parse tree in a WCFG is the weight of the rule used to produce the top node, plus the weights of its children...
s.
Techniques
A variant of the CYK algorithmCYK algorithm
The Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars. It employs bottom-up parsing and dynamic programming....
finds the Viterbi parse
Viterbi algorithm
The Viterbi algorithm is a dynamic programming algorithm for finding the most likely sequence of hidden states – called the Viterbi path – that results in a sequence of observed events, especially in the context of Markov information sources, and more generally, hidden Markov models...
of a sequence for a given SCFG. The Viterbi parse is the most likely derivation (parse) of the sequence by the given SCFG.
The Inside-Outside algorithm is an analogue of the Forward-Backward algorithm, and can be used to compute the total probability of all derivations that are consistent with a given sequence, based on some SCFG. This is equivalent to the probability of the SCFG generating the sequence, and is intuitively a measure of how consistent the sequence is with the given grammar.
The Inside/Outside algorithms
Inside-outside algorithm
The inside-outside algorithm is a way of re-estimating production probabilities in a probabilistic context-free grammar. It was introduced by James K. Baker in 1979 as a generalization of the forward-backward algorithm for parameter estimation on hidden Markov models to stochastic context-free...
can also be used to compute the probabilities that a given production will be used in a random derivation of a sequence. This is used as part of an Expectation-maximization algorithm
Expectation-maximization algorithm
In statistics, an expectation–maximization algorithm is an iterative method for finding maximum likelihood or maximum a posteriori estimates of parameters in statistical models, where the model depends on unobserved latent variables...
to learn the maximum likelihood
Maximum likelihood
In statistics, maximum-likelihood estimation is a method of estimating the parameters of a statistical model. When applied to a data set and given a statistical model, maximum-likelihood estimation provides estimates for the model's parameters....
probabilities for an SCFG based on a set of training sequences that the SCFG should model. The algorithm is analogous to that used by hidden Markov models.
Natural language processing
Context-free grammars were originally conceived in an attempt to model natural languages, i.e. those normally spoken by humans. Such grammars were originally conceived as sets of rules (productions), typically specified in a syntax such as Backus-Naur Form, where the rules are absolute; such grammars are still appropriate for numerous situations (notably, the design of programming languages). Probabilistic CFGs are an attempt to deal with difficulties encountered when attempting to extend CFGs to deal with the complexities of natural languages. Specifically, one often finds that more than one production rule may apply to a sequence of words, thus resulting in a conflict: an ambiguous parse. This happens for several reasons, such as homographs - identically spelled words with multiple meanings - so that the same word sequence can have more than one interpretation. A punPun
The pun, also called paronomasia, is a form of word play which suggests two or more meanings, by exploiting multiple meanings of words, or of similar-sounding words, for an intended humorous or rhetorical effect. These ambiguities can arise from the intentional use and abuse of homophonic,...
is an example of an ambiguous parse used deliberately to humorous effect, such as in the newspaper headline "Iraqi Head Seeks Arms".
One strategy of dealing with ambiguous parses (originating with grammarians as early as Pāṇini) is to add yet more rules, or prioritize them so that one rule takes precedence over others. This, however, has the drawback of proliferating the rules, often to the point where they become hard to manage. Another difficulty is overgeneration, where unlicensed structures are also generated. Probabilistic grammars circumvent these problems by using the frequency of various productions to order them, resulting in a "most likely" (winner-take-all) interpretation. As usage patterns are altered in diachronic
Diachronic
Diachronic or Diachronous,from the Greek word Διαχρονικός , is a term for something happening over time. It is used in several fields of research.*Diachronic linguistics : see Historical linguistics...
shifts, these probabilistic rules can be re-learned, thus upgrading the grammar.
One may construct a probabilistic grammar from a traditional formal syntax by assigning each non-terminal a probability taken from some distribution, to be eventually estimated from usage data. On most samples of broad language, probabilistic grammars that tune these probabilities from data typically outperform hand-crafted grammars (although some rule-based grammars are now approaching the accuracies of PCFG).
Here is a tiny example of a 2-rule PCFG grammar. Each rule is preceded by a probability that reflects the relative frequency with which it occurs.
- 0.7 VP → V NP
- 0.3 VP → V NP NP
Given this grammar, we can now say that the number of NPs expected while deriving VPs is 0.7 x 1 + 0.3 x 2 = 1.3.
The probabilities are typically computed by machine learning
Machine learning
Machine learning, a branch of artificial intelligence, is a scientific discipline concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases...
programs that operate (learn or "train") on large volumes of annotated text ("corpora"), such as the Penn TreeBank, which contains a large collection of articles from the Wall Street Journal. As such, a probabilistic grammar's validity is constrained by context of the text used to train it; the definition of "most likely" is likely to vary with the context of discourse.
An example of a probabilistic CFG parser that has been trained using the Penn Treebank is the Stanford Statistical Parser of Klein and Manning, downloadable at http://nlp.stanford.edu/software/lex-parser.shtml
In particular, some speech recognition
Speech recognition
Speech recognition converts spoken words to text. The term "voice recognition" is sometimes used to refer to recognition systems that must be trained to a particular speaker—as is the case for most desktop recognition software...
systems use SCFGs to improve their probability estimate and thereby their performance.
RNA
Context-free grammars are adept at modeling the secondary structureSecondary structure
In biochemistry and structural biology, secondary structure is the general three-dimensional form of local segments of biopolymers such as proteins and nucleic acids...
of RNA
Secondary structure involves nucleotide
Nucleotide
Nucleotides are molecules that, when joined together, make up the structural units of RNA and DNA. In addition, nucleotides participate in cellular signaling , and are incorporated into important cofactors of enzymatic reactions...
s within a single-stranded RNA molecule that are complementary to each other, and therefore base pair. This base pairing is biologically important to the proper function of the RNA molecule. Much of this base pairing can be represented in a context-free grammar (the major exception being pseudoknot
Pseudoknot
A pseudoknot is a nucleic acid secondary structure containing at least two stem-loop structures in which half of one stem is intercalated between the two halves of another stem. The pseudoknot was first recognized in the turnip yellow mosaic virus in 1982...
s).
For example, consider the following grammar, where a,c,g,u represents nucleotides and S is the start symbol (and only non-terminal):
- S → aSu | cSg | gSc | uSa
This simple CFG represents an RNA molecule consisting entirely of two wholly complementary regions, in which only canonical complementary pairs are allowed (i.e. A-U and C-G).
By attaching probabilities to more sophisticated CFGs, it is possible to model bases or base pairings that are more or less consistent with an expected RNA molecule pattern. SCFGs are used to model the patterns in RNA gene families in the Rfam Database, and search genome sequences for likely additional members of these families. SCFGs have also been used to find RNA genes using comparative genomics. In this work, homologs of a potential RNA gene in two related organisms were inspected using SCFG techniques to see if their secondary structure is conserved. If it is, the sequence is likely to be an RNA gene, and the secondary structure is presumed to be conserved because of the functional needs of that RNA gene.
It has been shown that SCFGs could predict the secondary structure of an RNA molecule similarly to existing techniques: such SCFGs are used, for example, by the Stemloc
Stemloc
Stemloc is a program for pairwise RNA structural alignment based on probabilistic models of RNA structure known as Pair stochastic context-free grammars. Stemloc implements constrained versions of the Sankoff algorithms for simultaneous structure prediction and sequence alignment of multiple...
program.
Linguistics and Psycholinguistics
With the publication of Gold's Theorem 1967 it was claimed that grammars for natural languages governed by deterministic rules could not be learned based on positive instances alone. This was part of the argument from the poverty of stimulus, presented in 1980 and implicit since the early works by Chomsky of the 1950s. Among other arguments, this led to the nativistPsychological nativism
In the field of psychology, nativism is the view that certain skills or abilities are 'native' or hard wired into the brain at birth. This is in contrast to empiricism, the 'blank slate' or tabula rasa view, which states that the brain has inborn capabilities for learning from the environment but...
(but still unproven) view, that a form of grammar, including a complete conceptual lexicon in certain versions, were hardwired from birth.
However, Gold's learnability result can easily be circumvented, if we either assume the learner learns a near-perfect approximation to the correct language or that the learner learns from typical input rather than arbitrarily devious input. In fact, it has been proved that simply receiving input from a speaker who produces positive instances randomly rather than according to a pre-specified devious plan leads to identifiability in the limit with probability 1.
Recently, probabilistic grammars appear to have gained some cognitive plausibility. It is well known that there are degrees of difficulty in accessing different syntactic structures (e.g. the Accessibility Hierarchy for relative clause
Relative clause
A relative clause is a subordinate clause that modifies a noun phrase, most commonly a noun. For example, the phrase "the man who wasn't there" contains the noun man, which is modified by the relative clause who wasn't there...
s). Probabilistic versions of Minimalist Grammars
Minimalist grammar
Minimalist grammars are a class of formal grammars that aim to provide a more rigorous, usually proof-theoretic, formalization of Chomskyan Minimalist program than is normally provided in the mainstream Minimalist literature...
have been used to compute information-theoretic entropy
Entropy
Entropy is a thermodynamic property that can be used to determine the energy available for useful work in a thermodynamic process, such as in energy conversion devices, engines, or machines. Such devices can only be driven by convertible energy, and have a theoretical maximum efficiency when...
values which appear to correlate well with psycholinguistic data on understandability and production difficulty.