Delimiter
Encyclopedia
A delimiter is a sequence of one or more character
Character (computing)
In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language....

s used to specify the boundary between separate, independent regions in plain text
Plain text
In computing, plain text is the contents of an ordinary sequential file readable as textual material without much processing, usually opposed to formatted text....

 or other data streams. An example of a delimiter is the comma
Comma
A comma is a type of punctuation mark . The word comes from the Greek komma , which means something cut off or a short clause.Comma may also refer to:* Comma , a type of interval in music theory...

 character, which acts as a field delimiter in a sequence of comma-separated values
Comma-separated values
A comma-separated values file stores tabular data in plain-text form. As a result, such a file is easily human-readable ....

.

Delimiters represent one of various means to specify boundaries in a data stream
Data stream
In telecommunications and computing, a data stream is a sequence of digitally encoded coherent signals used to transmit or receive information that is in the process of being transmitted....

. Declarative notation, for example, is an alternate method that uses a length field at the start of a data stream to specify the number of characters that the data stream contains.

Overview

Delimiters can be broken down into:
  • Field and record delimiters; and
  • Bracket delimiters.

Field and record delimiters

Field delimiters separate data fields. Record delimiters separate groups of fields.

For example, the CSV file format uses a comma as the delimiter between fields
Field (computer science)
In computer science, data that has several parts can be divided into fields. Relational databases arrange data as sets of database records, also called rows. Each record consists of several fields; the fields of all records form the columns....

, and an end-of-line indicator as the delimiter between records
Row (database)
In the context of a relational database, a row—also called a record or tuple—represents a single, implicitly structured data item in a table. In simple terms, a database table can be thought of as consisting of rows and columns or fields...

. For instance:

fname,lname,age,salary
nancy,davolio,33,$30000
erin,borakova,28,$25250
tony,raphael,35,$28700

specifies a simple flat file database
Flat file database
A flat file database describes any of various means to encode a database model as a single file .- Overview :...

 table
Table (information)
A table is a means of arranging data in rows and columns.Production % of goalNorth 4087102%South 4093110% The use of tables is pervasive throughout all communication, research and data analysis. Tables appear in print media, handwritten notes, computer software, architectural...

 using the CSV file format.

Bracket delimiters

Bracket delimiters (also block delimiters, region delimiters or balanced delimiters) mark both the start and end of a region of text.

Common examples of bracket delimiters include:














DelimitersDescription
( and )Parentheses. The Lisp programming language syntax is cited as recognizable primarily by its use of parentheses.
{ and }Curly brackets.
< and >Angle brackets.
" and "commonly used to denote string literal
String literal
A string literal is the representation of a string value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question...

s.
' and 'commonly used to denote string literals.
and ?>used to indicate XML processing instruction
Processing Instruction
A Processing Instruction is an SGML and XML node type, which may occur anywhere in the document, intended to carry instructions to the application....

s.
/* and */used to denote comment
Comment (computer programming)
In computer programming, a comment is a programming language construct used to embed programmer-readable annotations in the source code of a computer program. Those annotations are potentially significant to programmers but typically ignorable to compilers and interpreters. Comments are usually...

s in some programming languages.
<% and %>used in some web template
Web template
A web template is a tool used to separate content from presentation in web design, and for mass-production of web documents. It is a basic component of a web template system.Web templates can be used to set up any type of website...

s to specify language boundaries. These are also called template delimiters.

Conventions

Computing platforms historically use certain delimiters by convention. The following tables depict just a few examples for comparison.

Programming languages
(See also, Comparison of programming languages (syntax)
Comparison of programming languages (syntax)
-Expressions:Programming language expressions can be broadly classifiedin three classes:prefix notation* Lisp infix notation* Fortran * * TUTOR $$ note implicit multiply operator...

). ! !! String Literal !! End of Statement
|-
! Pascal
| singlequote >
>-
! C
| doublequote, singlequote
>-


Field and Record delimiters (See also, ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

, Control character
Control character
In computing and telecommunication, a control character or non-printing character is a code point in a character set, that does not in itself represent a written symbol.It is in-band signaling in the context of character encoding....

).
! !! End of Field !! End of Record !! End of File
|-
! Unix, Mac OS X, Amiga OS
| Tab
Tab key
Tab key on a keyboard is used to advance the cursor to the next tab stop.- Origin :The word tab derives from the word tabulate, which means "to arrange data in a tabular, or table, form"...

 >
LF  >-
! Windows, MS-DOS, OS/2, CP/M
| Tab
Tab key
Tab key on a keyboard is used to advance the cursor to the next tab stop.- Origin :The word tab derives from the word tabulate, which means "to arrange data in a tabular, or table, form"...

 
CRLF  Control-Z
Control-Z
In computing, is a control character in ASCII code, also known as the substitute character or a keyboard shortcut. Strictly speaking, is not a printable character at all but a code for control purposes, though it is sometimes rendered by two characters as ^Z. It is generated by pressing the key...


>-
! Classic Mac OS, AppleDOS, ProDOS, GS/OS
| Tab
Tab key
Tab key on a keyboard is used to advance the cursor to the next tab stop.- Origin :The word tab derives from the word tabulate, which means "to arrange data in a tabular, or table, form"...

 
CR
Carriage return
Carriage return, often shortened to return, refers to a control character or mechanism used to start a new line of text.Originally, the term "carriage return" referred to a mechanism or lever on a typewriter...

 
>-
! ASCII/Unicode
| UNIT SEPARATOR
Position 31 (U+001F)
RECORD SEPARATOR
Position 30 (U+001E)
FILE SEPARATOR
Position 28 (U+001C)

Delimiter collision

Delimiter collision is a problem that occurs when an author or programmer introduces delimiters into text without actually intending them to be interpreted as boundaries between separate regions. In the case of XML, for example, this can occur whenever an author attempts to specify an angle bracket character. In most file types there is both a field delimiter and a record delimiter, both of which are subject to collision. In the case of comma-separated values
Comma-separated values
A comma-separated values file stores tabular data in plain-text form. As a result, such a file is easily human-readable ....

 files, for example, field collision can occur whenever an author attempts to include a comma as part of a field value (e.g., salary = "$30,000"), and record delimiter collision would occur whenever a field contained multiple lines. Both record and field delimiter collision occur frequently in text files.

In some contexts, a malicious user or attacker may seek to exploit this problem intentionally. Consequently, delimiter collision can be the source of security vulnerabilities
Vulnerability (computing)
In computer security, a vulnerability is a weakness which allows an attacker to reduce a system's information assurance.Vulnerability is the intersection of three elements: a system susceptibility or flaw, attacker access to the flaw, and attacker capability to exploit the flaw...

 and exploits
Exploit (computer security)
An exploit is a piece of software, a chunk of data, or sequence of commands that takes advantage of a bug, glitch or vulnerability in order to cause unintended or unanticipated behavior to occur on computer software, hardware, or something electronic...

. Malicious users can take advantage of delimiter collision in languages such as SQL
SQL
SQL is a programming language designed for managing data in relational database management systems ....

 and HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

 to deploy such well-known attacks as SQL injection
SQL injection
A SQL injection is often used to attack the security of a website by inputting SQL statements in a web form to get a badly designed website in order to dump the database content to the attacker. SQL injection is a code injection technique that exploits a security vulnerability in a website's software...

 and cross-site scripting
Cross-site scripting
Cross-site scripting is a type of computer security vulnerability typically found in Web applications that enables attackers to inject client-side script into Web pages viewed by other users. A cross-site scripting vulnerability may be used by attackers to bypass access controls such as the same...

, respectively.

Solutions

Because delimiter collision is a very common problem, various methods for avoiding it have been invented. Some authors may attempt to avoid the problem by choosing a delimiter character (or sequence of characters) that is not likely to appear in the data stream itself. This ad hoc approach may be suitable, but it necessarily depends on a correct guess of what will appear in the data stream, and offers no security against malicious collisions. Other, more formal conventions are therefore applied as well.

ASCII delimited text

The ASCII and Unicode character sets were designed to solve this problem by the provision of non-printing characters that can be used as delimiters. These are the range from ASCII 28 File Separator to ASCII 31 Unit Separator. The use of ASCII 31 Unit separator as a field separator and ASCII 30 Record separator solves the problem of both field and record delimiters that appear in a text data stream.

Escape character

One method for avoiding delimiter collision is to use escape character
Escape character
In computing and telecommunication, an escape character is a character which invokes an alternative interpretation on subsequent characters in a character sequence. An escape character is a particular case of metacharacters...

s. From a language design standpoint, these are adequate, but they have drawbacks:
  • text can be rendered unreadable when littered with numerous escape characters, a problem referred to as leaning toothpick syndrome
    Leaning toothpick syndrome
    In computer programming, leaning toothpick syndrome is the situation in which a quoted expression becomes unreadable because it contains a large number of escape characters, usually backslashes , to avoid delimiter collision....

     (due to use of \ to escape / in Perl
    Perl
    Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

     regular expression
    Regular expression
    In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

    s, leading to sequences such as "\/\/");
  • text becomes difficult to parse through regular expression
  • they require a mechanism to "escape the escapes" when not intended as escape characters; and
  • although easy to type, they can be cryptic to someone unfamiliar with the language.
  • they do not protect against injection attacks

Escape sequence

Escape sequences are similar to escape characters, except they usually consist of some kind of mnemonic instead of just a single character. One use is in string literal
String literal
A string literal is the representation of a string value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question...

s that include a doublequote (") character. For example in Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

, the code:


print "Nancy said \x22Hello World!\x22 to the crowd."; ### use \x22


produces the same output as:


print "Nancy said \"Hello World!\" to the crowd."; ### use escape char


One drawback of escape sequences, when used by people, is the need to memorize the codes that represent individual characters (see also: character entity reference
Character entity reference
In the markup languages SGML, HTML, XHTML and XML, a character entity reference is a reference to a particular kind of named entity that has been predefined or explicitly declared in a Document Type Definition . The "replacement text" of the entity consists of a single character from the Universal...

, numeric character reference
Numeric character reference
A numeric character reference is a common markup construct used in SGML and other SGML-related markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set of Unicode...

).

Dual quoting delimiters

In contrast to escape sequences and escape characters, dual delimiters provide yet another way to avoid delimiter collision. Some languages, for example, allow the use of either a single quote (') or a double quote (") to specify a string literal. For example in Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

:


print 'Nancy said "Hello World!" to the crowd.';


produces the desired output without requiring escapes. This approach, however, only works when the string does not contain both types of quotation marks.

Padding quoting delimiters

In contrast to escape sequences and escape characters, padding delimiters provide yet another way to avoid delimiter collision. Visual Basic
Visual Basic
Visual Basic is the third-generation event-driven programming language and integrated development environment from Microsoft for its COM programming model...

, for example, uses double quotes as delimiters. This is similar to escaping the delimiter.


print "Nancy said ""Hello World!"" to the crowd."


produces the desired output without requiring escapes. Like regular escaping it can, however, become confusing when many quotes are used.
The code to print the above source code would look more confusing:


print "print ""Nancy said """"Hello World!"""" to the crowd."""

Multiple quoting delimiters

In contrast to dual delimiters, multiple delimiters are even more flexible for avoiding delimiter collision.

For example in Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

:

print qq^Nancy doesn't want to say "Hello World!" anymore.^;
print qq@Nancy doesn't want to say "Hello World!" anymore.@;
print qq(Nancy doesn't want to say "Hello World!" anymore.);

all produce the desired output through use of the quotelike operator, which allows any convenient character to act as a delimiter. Although this method is more flexible, few languages support it. Perl and Ruby
Ruby (programming language)
Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...

 are two that do.

Content boundary

A content boundary is a special type of delimiter that is specifically designed to resist delimiter collision. It works by allowing the author to specify a sequence of characters that is guaranteed to always indicate a boundary between parts in a multi-part message, with no other possible interpretation.

The delimiter is frequently generated from a random sequence of characters that is statistically improbable to occur in the content. This may be followed by an identifying mark such as a UUID, a timestamp
Timestamp
A timestamp is a sequence of characters, denoting the date or time at which a certain event occurred. A timestamp is the time at which an event is recorded by a computer, not the time of the event itself...

, or some other distinguishing mark. Alternatively, the content may be scanned to guarantee that a delimiter does not appear in the text. This may allow the delimiter to be shorter or simpler, and increase the human readability of the document. (See e.g., MIME, Here documents).

Whitespace or indentation

Some programming and computer languages allow the use of whitespace delimiters or indentation
Indentation
An indentation may refer to:* A notch, or deep recesses; for instance in a coastline, or a carving in rock* The placement of text farther to the right to separate it from surrounding text....

 as a means of specifying boundaries between independent regions in text.

Regular expression syntax

In specifying a regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

, alternate delimiters may also be used to simplify the syntax for match and substitution operations in Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

.

For example, a simple match operation may be specified in Perl with the following syntax:


$string1 = 'Nancy said "Hello World!" to the crowd.'; # specify a target string
print $string1 =~ m/[aeiou]+/; # match one or more vowels


The syntax is flexible enough to specify match operations with alternate delimiters, making it easy to avoid delimiter collision:


$string1 = 'Nancy said "http://Hello/World.htm" is not a valid address.'; # target string

print $string1 =~ m@http://@; # match using alternate regular expression delimiter
print $string1 =~ m{http://}; # same as previous, but different delimiter
print $string1 =~ m!http://!; # same as previous, but different delimiter.


Here document

A Here document allows the inclusion of arbitrary content by describing a special end sequence. Many languages support this including PHP
PHP
PHP is a general-purpose server-side scripting language originally designed for web development to produce dynamic web pages. For this purpose, PHP code is embedded into the HTML source document and interpreted by a web server with a PHP processor module, which generates the web page document...

, bash scripts and perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

. A here document starts by describing what the end sequence will be and continues until that sequence is seen at the start of a new line.

Here is an example in perl:


print < It's very hard to encode a string with "certain characters".

Newlines, commas, and other characters can cause delimiter collisions.
ENDOFHEREDOC


This code would print:
It's very hard to encode a string with "certain characters".

Newlines, commas, and other characters can cause delimiter collisions.

By using a special end sequence all manner of characters are allowed in the string.

ASCII armor

Although principally used as a mechanism for text encoding of binary data,
ASCII armoring is a programming and systems administration technique that also helps to avoid delimiter collision in some circumstances. This technique is contrasted from the other approaches described above because it is more complicated, and therefore not suitable for small applications and simple data storage formats. The technique employs a special encoding scheme, such as base64
Base64
Base64 is a group of similar encoding schemes that represent binary data in an ASCII string format by translating it into a radix-64 representation...

, to ensure that delimiter characters do not appear in transmitted data.

This technique is used, for example, in Microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...

's ASP.NET
ASP.NET
ASP.NET is a Web application framework developed and marketed by Microsoft to allow programmers to build dynamic Web sites, Web applications and Web services. It was first released in January 2002 with version 1.0 of the .NET Framework, and is the successor to Microsoft's Active Server Pages ...

 web development technology, and is closely associated with the "VIEWSTATE" component of that system.

Example

The following simplified example demonstrates how this technique works in practice.

The first code fragment shows a simple HTML tag in which the VIEWSTATE value contains characters that are incompatible with the delimiters of the HTML tag itself:





This first code fragment is not well-formed
Well-formed element
In web page design, and generally for all markup languages such as SGML, HTML, and XML, a well-formed element is one that is either*opened and subsequently closed,*an empty element, which in that case must be terminated,...

, and would therefore not work properly in a "real world" deployed system.

In contrast, the second code fragment shows the same HTML tag, except this time incompatible characters in the VIEWSTATE value are removed through the application of base64 encoding:





This prevents delimiter collision and ensures that incompatible characters will not appear inside the HTML code, regardless of what characters appear in the original (decoded) text.

See also

  • Delimiter-separated values
  • String literal
    String literal
    A string literal is the representation of a string value within the source code of a computer program. There are numerous alternate notations for specifying string literals, and the exact notation depends on the individual programming language in question...

  • CamelCase
    CamelCase
    CamelCase , also known as medial capitals, is the practice of writing compound words or phrases in which the elements are joined without spaces, with each element's initial letter capitalized within the compound and the first letter either upper or lower case—as in "LaBelle", "BackColor",...

     (used in WikiWikiWeb
    WikiWikiWeb
    WikiWikiWeb is a term that has been used to refer to four things: the first wiki, or user-editable website, launched on 25 March 1995 by Ward Cunningham as part of the Portland Pattern Repository ; the Perl-based application that was used to run it, also developed by Cunningham, which was the first...

     as an alternate method of link creation that does not require delimiters to indicate links)
  • Federal Standard 1037C
    Federal Standard 1037C
    Federal Standard 1037C, titled Telecommunications: Glossary of Telecommunication Terms is a United States Federal Standard, issued by the General Services Administration pursuant to the Federal Property and Administrative Services Act of 1949, as amended....

     (contains a simple definition for "delimiter")
  • Naming collision
    Naming collision
    A naming collision is a circumstance where two or more identifiers in a given namespace or a given scope cannot be unambiguously resolved, and such unambiguous resolution is a requirement of the underlying system.- Example: XML element names :...

  • Sigil
    Sigil (computer programming)
    In computer programming, a sigil is a symbol attached to a variable name, showing the variable's datatype or scope. In 1999 Philip Gwyn adopted the term "to mean the funny character at the front of a Perl variable".- Historical context:...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK