Stochastic context-free grammar

A stochastic context-free grammar (SCFG) is a context-free grammar in which each production is augmented with a probability. The probability of a derivation (parse) is then the product of the probabilities of the productions used in that derivation; thus some derivations are more consistent with the stochastic grammar than others. SCFGs extend context-free grammars in the same way that hidden Markov models extend regular grammars. SCFGs have application in two areas: Natural language processing and the study of RNA molecules within the field of Bioinformatics.

Table of contents

1 Techniques
2 Applications

2.1 Natural language processing
2.2 RNA

3 References
4 External Links

Techniques

A variant of the CYK algorithm finds the Viterbi parse of a sequence for a given SCFG. The Viterbi parse is the most likely derivation (parse) of the sequence by the given SCFG.

The Inside and Outside algorithms are analogues of the Forward algorithm and Backward algorithm, and can be used to compute the total probability of all derivations that are consistent with a given sequence, based on some SCFG. This is equivalent to the probability of the SCFG generating the sequence, and is intuitively a measure of how consistent the sequence is with the given grammar.

The Inside/Outside algorithms can also be used to compute the probabilities that a given production will be used in a random derivation of a sequence. This is used as part of an Expectation maximization algorithm to learn the maximum likelihood probabilities for an SCFG based on a set of training sequences that the SCFG should model. The algorithm is analogous to that used by hidden Markov models.

Applications

Natural language processing

Context-free grammars were originally conceived in an attempt to model natural languages, i.e. those normally spoken by humans. Some research has extended this idea with SCFGs.

RNA

Context-free grammars are adept at modeling the secondary structure of RNA. Secondary structure involves nucleotides within a single-stranded RNA molecule that are complementary to each other, and therefore base pair. This base pairing is biologically important to the proper function of the RNA molecule. Much of this base pairing can be represented in a context-free grammar (the major exception being pseudoknots).

For example, consider the following grammar, where a,c,g,u represents nucleotides and S is the start symbol (and only non-terminal):

S --> aSu | cSg | gSc | uSa

This simple CFG represents an RNA molecule consisting entirely of two wholely complementary regions, in which only canonical complementary pairs are allowed (i.e. A-U and C-G).

By attaching probabilities to more sophisticated CFGs, it is possible to model bases or base pairings that are more or less consistent with an expected RNA molecule pattern. SCFGs are used to model the patterns in RNA gene families in the Rfam Database, and search genome sequences for likely additional members of these families. SCFGs have also been used to find RNA genes using comparative genomics. In this work, homologs of a potential RNA gene in two related organisms were inspected using SCFG techniques to see if their secondary structure is conserved. If it is, the sequence is likely to be an RNA gene, and the secondary structure is presumed to be conserved because of the functional needs of that RNA gene. It has been shown that SCFGs could predict the secondary structure of an RNA molecule similarly to existing techniques, although this application has not been widely adopted.

References

Durbin, Eddy, Krogh, Mitchison, Biological sequence analysis, Cambridge University Press, 1998. This bioinformatics textbook includes an accessible introduction to the use of SCFGs for modelling RNA, as well as the history of this application to 1998.
Elena Rivas and Sean R. Eddy (2001) "Noncoding RNA gene detection using comparative sequence analysis", BMC Bioinformatics, 2 (1): 8. This article in PubMed

External Links

Rfam Database