Chomski
Encyclopedia
chomski virtual machine (named after the noted linguist Noam Chomsky
) and pp (the pattern parser) refer to both a command line computer language and utility (interpreter for that language) which can be used to parse and transform text patterns. The utility reads input files character by character (sequentially), applying the operation which has been specified via the command line or a pp script, and then outputs the line. It was developed from 2006 as a Unix
and Windows utility, and is available today for Windows and Linux systems. Pp has derived a number of ideas and syntax elements from Sed
, a command line text stream editor.
, the Unix stream editor. For example, sed includes two virtual variables or data buffers, known as the "pattern space" and the "hold space". These two variables constitute an extremely simple virtual machine
. In the Chomski language this virtual machine has been augmented with several new buffers or registers
along with a number of commands to manipulate these buffers.
The chomski virtual machine includes a tape
data structure
as well as a stack (data structure)
, along with a "workspace" (which is the equivalent of the sed "pattern space" and a number of other buffers of lesser importance. This virtual machine is designed specifically to be apt for the parsing of formal language
s. This parsing
process traditionally involves two phases; the lexical analysis
phase and the formal grammar
phase. During the lexical analysis phase as series of tokens are generated. These tokens are then used as the input for a set of formal grammar rule. The chomski virtual machine uses the stack to hold these tokens and uses the tape structure to hold the attributes of these parse tokens. In a pp
script, these two phases, lexing and parsing, are combined in one
script file. A series of command words are used to manipulate the different data structures of the virtual machine.
, grep
, etc) process text one line at a time, and use regular expressions to search or transform text, the pp tool processes text one character at a time and can use context free grammars to transform (or compile
) the text. However, in common with the Unix philosophy
, the pp tool works upon plain text streams, encoded according to the locale of the local computer, and produces as output another plain text stream, allowing the pp tool to be used as part of a standard pipeline.
The motivation for the creation of the pp tool and the chomski virtual machine was to allow the writing of parsing scripts, rather than having to resort to traditional parsing tools such as Lex and Yacc.
cat inputFileName | chomski -s '/(/ { until ")"; print; } clear;' > outputFileName
In the above script, only text within brackets would be saved in the output file.
Under Unix (and Windows), chomski can be used as a filter
in a pipeline
:
generate_data | chomski -s '/x/{clear;add "y";}print;clear;'
That is, generate the data, and then make the small change of replacing x with y.
Several commands can be put together in a file called, for example, substitute.chom and then be applied using the -f option to read the commands from the file:
cat inputFileName | chomski -f substitute.chom > outputFileName
Besides substitution, other forms of simple processing are possible. For example, the following uses the plus and count commands to count the number of lines in a file:
cat inputFileName | chomski -s '[-n]{plus;} <>{count;print;}'
This example used some of the following metacharacter
s and language features:
Complex chomski constructs are possible, allowing it to serve as a simple, but highly specialised, programming language
. Chomski has only one flow control statement (apart from the test structures
and sed
command. Development began approximately in 2006 and continued sporadically.
strings, since the current implementation uses standard C
character arrays. Chomski does not currently have a debugger for debugging complex scripts.
Noam Chomsky
Avram Noam Chomsky is an American linguist, philosopher, cognitive scientist, and activist. He is an Institute Professor and Professor in the Department of Linguistics & Philosophy at MIT, where he has worked for over 50 years. Chomsky has been described as the "father of modern linguistics" and...
) and pp (the pattern parser) refer to both a command line computer language and utility (interpreter for that language) which can be used to parse and transform text patterns. The utility reads input files character by character (sequentially), applying the operation which has been specified via the command line or a pp script, and then outputs the line. It was developed from 2006 as a Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...
and Windows utility, and is available today for Windows and Linux systems. Pp has derived a number of ideas and syntax elements from Sed
Sed
sed is a Unix utility that parses text and implements a programming language which can apply transformations to such text. It reads input line by line , applying the operation which has been specified via the command line , and then outputs the line. It was developed from 1973 to 1974 as a Unix...
, a command line text stream editor.
Features
The chomski language uses many ideas taken from sedSed
sed is a Unix utility that parses text and implements a programming language which can apply transformations to such text. It reads input line by line , applying the operation which has been specified via the command line , and then outputs the line. It was developed from 1973 to 1974 as a Unix...
, the Unix stream editor. For example, sed includes two virtual variables or data buffers, known as the "pattern space" and the "hold space". These two variables constitute an extremely simple virtual machine
Virtual machine
A virtual machine is a "completely isolated guest operating system installation within a normal host operating system". Modern virtual machines are implemented with either software emulation or hardware virtualization or both together.-VM Definitions:A virtual machine is a software...
. In the Chomski language this virtual machine has been augmented with several new buffers or registers
Processor register
In computer architecture, a processor register is a small amount of storage available as part of a CPU or other digital processor. Such registers are addressed by mechanisms other than main memory and can be accessed more quickly...
along with a number of commands to manipulate these buffers.
The chomski virtual machine includes a tape
Tape
Tape refers to a strip of long, thin and narrow material, usually rolled up. Most commonly, it refers to:- Recording media :* Cassette tape* Digital Audio Tape * Digital Compact Cassette * Digital Tape Format* Magnetic tape sound recording...
data structure
Data structure
In computer science, a data structure is a particular way of storing and organizing data in a computer so that it can be used efficiently.Different kinds of data structures are suited to different kinds of applications, and some are highly specialized to specific tasks...
as well as a stack (data structure)
Stack (data structure)
In computer science, a stack is a last in, first out abstract data type and linear data structure. A stack can have any abstract data type as an element, but is characterized by only three fundamental operations: push, pop and stack top. The push operation adds a new item to the top of the stack,...
, along with a "workspace" (which is the equivalent of the sed "pattern space" and a number of other buffers of lesser importance. This virtual machine is designed specifically to be apt for the parsing of formal language
Formal language
A formal language is a set of words—that is, finite strings of letters, symbols, or tokens that are defined in the language. The set from which these letters are taken is the alphabet over which the language is defined. A formal language is often defined by means of a formal grammar...
s. This parsing
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...
process traditionally involves two phases; the lexical analysis
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...
phase and the formal grammar
Formal grammar
A formal grammar is a set of formation rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax...
phase. During the lexical analysis phase as series of tokens are generated. These tokens are then used as the input for a set of formal grammar rule. The chomski virtual machine uses the stack to hold these tokens and uses the tape structure to hold the attributes of these parse tokens. In a pp
script, these two phases, lexing and parsing, are combined in one
script file. A series of command words are used to manipulate the different data structures of the virtual machine.
Purpose and Motivation
The purpose of the pp tool is to parse and transform text patterns. The text patterns conform to the rules provided in a formal language and include many context free languages. Where as traditional Unix tools (such as awk, sedSed
sed is a Unix utility that parses text and implements a programming language which can apply transformations to such text. It reads input line by line , applying the operation which has been specified via the command line , and then outputs the line. It was developed from 1973 to 1974 as a Unix...
, grep
Grep
grep is a command-line text-search utility originally written for Unix. The name comes from the ed command g/re/p...
, etc) process text one line at a time, and use regular expressions to search or transform text, the pp tool processes text one character at a time and can use context free grammars to transform (or compile
Compile
Compile may refer to:* Compile , a Japanese video game company founded in 1983 that specialized in shoot 'em up and computer puzzle game genres...
) the text. However, in common with the Unix philosophy
Unix philosophy
The Unix philosophy is a set of cultural norms and philosophical approaches to developing software based on the experience of leading developers of the Unix operating system.-McIlroy: A Quarter Century of Unix:...
, the pp tool works upon plain text streams, encoded according to the locale of the local computer, and produces as output another plain text stream, allowing the pp tool to be used as part of a standard pipeline.
The motivation for the creation of the pp tool and the chomski virtual machine was to allow the writing of parsing scripts, rather than having to resort to traditional parsing tools such as Lex and Yacc.
Usage
The following example shows a typical use of chomski, where the -s option indicates that the chomski expression follows:cat inputFileName | chomski -s '/(/ { until ")"; print; } clear;' > outputFileName
In the above script, only text within brackets would be saved in the output file.
Under Unix (and Windows), chomski can be used as a filter
Filter (Unix)
In Unix and Unix-like operating systems, a filter is a program that gets most of its data from its standard input and writes its main results to its standard output . Unix filters are often used as elements of pipelines...
in a pipeline
Pipeline (Unix)
In Unix-like computer operating systems , a pipeline is the original software pipeline: a set of processes chained by their standard streams, so that the output of each process feeds directly as input to the next one. Each connection is implemented by an anonymous pipe...
:
generate_data | chomski -s '/x/{clear;add "y";}print;clear;'
That is, generate the data, and then make the small change of replacing x with y.
Several commands can be put together in a file called, for example, substitute.chom and then be applied using the -f option to read the commands from the file:
cat inputFileName | chomski -f substitute.chom > outputFileName
Besides substitution, other forms of simple processing are possible. For example, the following uses the plus and count commands to count the number of lines in a file:
cat inputFileName | chomski -s '[-n]{plus;} <>{count;print;}'
This example used some of the following metacharacter
Metacharacter
A metacharacter is a character that has a special meaning to a computer program, such as a shell interpreter or a regular expression engine.-Examples:...
s and language features:
- The square BracketBracketBrackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...
s ([]
) indicate the matching of a character class. - The
-n
string matches a newline character. - The
<>
string matches the end of the input stream (text file). - The curly braces (
{}
) follow tests and group multiple statements. - The semi-colon (
;
) terminates all statements,
Complex chomski constructs are possible, allowing it to serve as a simple, but highly specialised, programming language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....
. Chomski has only one flow control statement (apart from the test structures
<>
, []
, //
etc), namely the check command, which jumps back to the @@ label (no other labels are permitted).History
The idea for chomski arose from the limitations of regular expression engines which use a line by line paradigm, and the limitations on parsing nested text patterns with regular expressions. chomski evolved as a natural progression from the grepGrep
grep is a command-line text-search utility originally written for Unix. The name comes from the ed command g/re/p...
and sed
Sed
sed is a Unix utility that parses text and implements a programming language which can apply transformations to such text. It reads input line by line , applying the operation which has been specified via the command line , and then outputs the line. It was developed from 1973 to 1974 as a Unix...
command. Development began approximately in 2006 and continued sporadically.
Limitations
Chomski is not a general purpose programming language. Like sed it is designed for a limited type of usage. chomski currently does not support unicodeUnicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
strings, since the current implementation uses standard C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
character arrays. Chomski does not currently have a debugger for debugging complex scripts.