Delimited
Encyclopedia
Formats that use delimiter-separated values (also DSV) store two-dimensional arrays of data by separating the values in each row with specific delimiter
characters
. Most database
and spreadsheet
programs are able to read or save data in a delimited format.
, tab
, and colon
. The vertical bar
(also referred to as pipe) and space
are also sometimes used. In a comma-separated values
(CSV) file
the data items are separated using commas as a delimiter, while in a tab-separated values
(TSV) file, the data items are separated using tabs as a delimiter. Column headers are sometimes included as the first line, and each subsequent line is a row of data. The lines are separated by newline
s.
For example, the following fields in each record are delimited by commas, and each record by newlines:
"Date","Pupil","Grade"
"25 May","Bloggs, Fred","C"
"25 May","Doe, Jane","B"
"15 July","Bloggs, Fred","A"
"15 April","Muniz, Alvin ""Hank""","A"
Note the use of the double quote to enclose each field. This prevents the comma in the actual field value (Bloggs, Fred; Doe, Jane and etc.) from being interpreted as a field separator. This necessitates a way to "escape" the field wrapper itself, in this case the double quote; it is customary to double the double quotes actually contained in a field as with those surrounding "Hank". In this way, any ASCII
text including newlines can be contained in a field.
ASCII
includes several control character
s that are intended to be used as delimiters. They are: 28 file separator, 29 group separator, 30 record separator, 31 unit separator. Use of these characters has not achieved widespread adoption; some systems have replaced their control properties with more accepted controls such as CR/LF
and TAB.
programs, and statistical analysis tools such as PSPP
, without the user designating which delimiter has been used.
Typically a delimited file format is indicated by a specification. Some specifications provide conventions for avoiding delimiter collision, others do not. Delimiter collision is a problem that occurs when a character that is intended as part of the data gets interpreted as a delimiter instead. Comma- and space-separated formats often suffer from this problem, since in many contexts those characters are legitimate parts of a data field.
Most such files avoid delimiter collision either by surrounding all data fields in double quotes, or only quoting those data fields that contain the delimiter character. One problem with tab-delimited text files is that tabs are generally non-printing characters, and difficult to distinguish from spaces; therefore, there are sometimes problems with the files being corrupted when people try to edit them by hand. Another set of problems occur due to errors in the file structure, usually during import of file into a database
(in the example above, such error may be a pupil's first name missing).
Depending on the data itself, it may be beneficial to use non-standard characters such as the tilde (~) as delimiters. With rising prevalence of web sites and other applications that store snippets of code in databases, simply using a " which occurs in every hyperlink and image source tag simply isn't sufficient to avoid this type of collision. Since colons (:), semi-colons , pipes (|), and many other characters are also used, it can be quite challenging to find a character that isn't being used elsewhere.
Delimiter
A delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values.Delimiters represent...
characters
Character (computing)
In computer and machine-based telecommunications terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol, such as in an alphabet or syllabary in the written form of a natural language....
. Most database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
and spreadsheet
Spreadsheet
A spreadsheet is a computer application that simulates a paper accounting worksheet. It displays multiple cells usually in a two-dimensional matrix or grid consisting of rows and columns. Each cell contains alphanumeric text, numeric values or formulas...
programs are able to read or save data in a delimited format.
Delimited formats
Any character or sequence of characters may be used to separate the values, but the most common delimiters are the commaComma (punctuation)
The comma is a punctuation mark. It has the same shape as an apostrophe or single closing quotation mark in many typefaces, but it differs from them in being placed on the baseline of the text. Some typefaces render it as a small line, slightly curved or straight but inclined from the vertical, or...
, tab
Tab stop
A tab stop on a typewriter is a location where the carriage movement is halted by mechanical gears. Tab stops are set manually, and pressing the tab key causes the carriage to go to the next tab stop...
, and colon
Colon (punctuation)
The colon is a punctuation mark consisting of two equally sized dots centered on the same vertical line.-Usage:A colon informs the reader that what follows the mark proves, explains, or lists elements of what preceded the mark....
. The vertical bar
Vertical bar
The vertical bar is a character with various uses in mathematics, where it can be used to represent absolute value, among others; in computing and programming and in general typography, as a divider not unlike the interpunct...
(also referred to as pipe) and space
Space (punctuation)
In writing, a space is a blank area devoid of content, serving to separate words, letters, numbers, and punctuation. Conventions for interword and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex....
are also sometimes used. In a comma-separated values
Comma-separated values
A comma-separated values file stores tabular data in plain-text form. As a result, such a file is easily human-readable ....
(CSV) file
Computer file
A computer file is a block of arbitrary information, or resource for storing information, which is available to a computer program and is usually based on some kind of durable storage. A file is durable in the sense that it remains available for programs to use after the current program has finished...
the data items are separated using commas as a delimiter, while in a tab-separated values
Tab-separated values
A tab-separated values file is a simple text format for a database table. Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab stop character – it is a form of the more general delimiter-separated values format.TSV is a simple file...
(TSV) file, the data items are separated using tabs as a delimiter. Column headers are sometimes included as the first line, and each subsequent line is a row of data. The lines are separated by newline
Newline
In computing, a newline, also known as a line break or end-of-line marker, is a special character or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line—that is, on the next line below the...
s.
For example, the following fields in each record are delimited by commas, and each record by newlines:
"Date","Pupil","Grade"
"25 May","Bloggs, Fred","C"
"25 May","Doe, Jane","B"
"15 July","Bloggs, Fred","A"
"15 April","Muniz, Alvin ""Hank""","A"
Note the use of the double quote to enclose each field. This prevents the comma in the actual field value (Bloggs, Fred; Doe, Jane and etc.) from being interpreted as a field separator. This necessitates a way to "escape" the field wrapper itself, in this case the double quote; it is customary to double the double quotes actually contained in a field as with those surrounding "Hank". In this way, any ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
text including newlines can be contained in a field.
ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...
includes several control character
Control character
In computing and telecommunication, a control character or non-printing character is a code point in a character set, that does not in itself represent a written symbol.It is in-band signaling in the context of character encoding....
s that are intended to be used as delimiters. They are: 28 file separator, 29 group separator, 30 record separator, 31 unit separator. Use of these characters has not achieved widespread adoption; some systems have replaced their control properties with more accepted controls such as CR/LF
Newline
In computing, a newline, also known as a line break or end-of-line marker, is a special character or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line—that is, on the next line below the...
and TAB.
Uses and applications
Due to their widespread use, comma- and tab-delimited text files can be opened by several kinds of applications, including most spreadsheetSpreadsheet
A spreadsheet is a computer application that simulates a paper accounting worksheet. It displays multiple cells usually in a two-dimensional matrix or grid consisting of rows and columns. Each cell contains alphanumeric text, numeric values or formulas...
programs, and statistical analysis tools such as PSPP
PSPP
PSPP is a free software application for analysis of sampled data. It has a graphical user interface and conventional command line interface. It is written in C, uses GNU Scientific Library for its mathematical routines, and plotutils for generating graphs....
, without the user designating which delimiter has been used.
Typically a delimited file format is indicated by a specification. Some specifications provide conventions for avoiding delimiter collision, others do not. Delimiter collision is a problem that occurs when a character that is intended as part of the data gets interpreted as a delimiter instead. Comma- and space-separated formats often suffer from this problem, since in many contexts those characters are legitimate parts of a data field.
Most such files avoid delimiter collision either by surrounding all data fields in double quotes, or only quoting those data fields that contain the delimiter character. One problem with tab-delimited text files is that tabs are generally non-printing characters, and difficult to distinguish from spaces; therefore, there are sometimes problems with the files being corrupted when people try to edit them by hand. Another set of problems occur due to errors in the file structure, usually during import of file into a database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
(in the example above, such error may be a pupil's first name missing).
Depending on the data itself, it may be beneficial to use non-standard characters such as the tilde (~) as delimiters. With rising prevalence of web sites and other applications that store snippets of code in databases, simply using a " which occurs in every hyperlink and image source tag simply isn't sufficient to avoid this type of collision. Since colons (:), semi-colons , pipes (|), and many other characters are also used, it can be quite challenging to find a character that isn't being used elsewhere.
See also
- Comma-separated valuesComma-separated valuesA comma-separated values file stores tabular data in plain-text form. As a result, such a file is easily human-readable ....
- DelimiterDelimiterA delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values.Delimiters represent...
- Fielded textFielded textFielded Text is a proposed standard which provides structure and schema definition to text files which contain tables of values . The standard allows the format and structure of the data within the text file to be specified by a Meta file...
- Tab-separated valuesTab-separated valuesA tab-separated values file is a simple text format for a database table. Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab stop character – it is a form of the more general delimiter-separated values format.TSV is a simple file...