Comma-separated values
Encyclopedia
A comma-separated values (CSV) file stores tabular data (numbers and text) in plain-text form. As a result, such a file is easily human-readable (e.g., in a text editor).

CSV is a simple file format that is widely supported by consumer, business, and scientific applications.
Among its most common uses is to move tabular data between programs that naturally operate on a more efficient or complete proprietary format.
For example: a CSV file might be used to transfer information from a database program to a spreadsheet.

Lack of a Standard

The name indicates the use of the comma to separate data fields, but "CSV" is often applied to files using other delimiters.
Different CSV-like implementations frequently arise as the format is modified to handle richer table content such as allowing a different field separator character (necessary if numeric fields are written with a comma instead of a decimal point) or extensions to allow numbers, the separator character, or newline characters in text fields.

This has led to the assertion that there is no “CSV standard”: only the understanding that plain text is delimited by a symbol. In common usage almost any delimiter-separated text data may be referred to as a CSV file. Traditionally, however, lines in the text file represent rows in a table, and commas separate the columns. This traditional understanding is embodied in RFC 4180, which is the best known effort to formalize a CSV standard.

Examples

Example of a USA/UK CSV file (where the decimal separator
Decimal separator
Different symbols have been and are used for the decimal mark. The choice of symbol for the decimal mark affects the choice of symbol for the thousands separator used in digit grouping. Consequently the latter is treated in this article as well....

 is a period/full stop and the value separator is a comma):

Year,Make,Model,Length
1997,Ford,E350,2.34
2000,Mercury,Cougar,2.38

Example of an analogous German and Dutch CSV/DSV file (where the decimal separator
Decimal separator
Different symbols have been and are used for the decimal mark. The choice of symbol for the decimal mark affects the choice of symbol for the thousands separator used in digit grouping. Consequently the latter is treated in this article as well....

 is a comma and the value separator is a semicolon):

Year;Make;Model;Length
1997;Ford;E350;2,34
2000;Mercury;Cougar;2,38

The latter format is not RFC 4180 compliant.
Compliance could be achieved by the use of a comma instead of a semicolon as a separator and either the international notation for the representation of the decimal mark or the practice of quoting all numbers that have a decimal mark.

Technical background

The format dates back to the early days of business computing and is widely used to pass data between computers with different internal word sizes, data formatting needs, and so forth. For this reason, CSV files are common on all computer platforms.

CSV is a delimited text file that uses a comma
Comma (punctuation)
The comma is a punctuation mark. It has the same shape as an apostrophe or single closing quotation mark in many typefaces, but it differs from them in being placed on the baseline of the text. Some typefaces render it as a small line, slightly curved or straight but inclined from the vertical, or...

 to separate values (many implementations of CSV import/export tools allow other separators to be used). Simple CSV implementations will not allow field values that contain a comma or other special characters such as newline
Newline
In computing, a newline, also known as a line break or end-of-line marker, is a special character or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line—that is, on the next line below the...

s. More sophisticated CSV implementations permit commas and other special characters in a field value. Many implementations use " (double quote) characters around values that contain reserved characters (such as commas, double quotes, or newline
Newline
In computing, a newline, also known as a line break or end-of-line marker, is a special character or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line—that is, on the next line below the...

s); embedded double quote characters may be represented by a pair of consecutive double quotes. Some CSV implementations may use an escape character
Escape character
In computing and telecommunication, an escape character is a character which invokes an alternative interpretation on subsequent characters in a character sequence. An escape character is a particular case of metacharacters...

 such as a backslash
Backslash
The backslash is a typographical mark used mainly in computing. It was first introduced to computers in 1960 by Bob Bemer. Sometimes called a reverse solidus or a slosh, it is the mirror image of the common slash....

 to encode reserved characters as an escape sequence, such as Sybase
Sybase
Sybase, an SAP company, is an enterprise software and services company offering software to manage, analyze, and mobilize information, using relational databases, analytics and data warehousing solutions and mobile applications development platforms....

 Central.

In computer science
Computer science
Computer science or computing science is the study of the theoretical foundations of information and computation and of practical techniques for their implementation and application in computer systems...

 terms, a CSV file is a "flat file
Flat file database
A flat file database describes any of various means to encode a database model as a single file .- Overview :...

".

History

Comma-separated values are old technology and pre-date personal computers by more than a decade: the IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...

 Fortran
Fortran
Fortran is a general-purpose, procedural, imperative programming language that is especially suited to numeric computation and scientific computing...

 (level G) compiler under OS/360 supported them in 1967. Comma-separated value lists were often easier to type into punched card
Punched card
A punched card, punch card, IBM card, or Hollerith card is a piece of stiff paper that contains digital information represented by the presence or absence of holes in predefined positions...

s than fixed-column-aligned data, and were less prone to producing incorrect results if a value was punched one column off from its intended location.

The comma separated list (CSL) is a data
Data
The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which...

 format
Format
Format may refer to:* File format, layout for electronic files* Text formatting, typesetting of text elements* Format , a command-line utility in many computer operating systems* Format , a computer command to prepare hard disks...

 originally known as comma-separated values (CSV) in the oldest days of simple computers. In the industry of personal computers (then more commonly known as "Home Computer
Home computer
Home computers were a class of microcomputers entering the market in 1977, and becoming increasingly common during the 1980s. They were marketed to consumers as affordable and accessible computers that, for the first time, were intended for the use of a single nontechnical user...

s"), the most common use was small businesses generating solicitations using boilerplate
Boilerplate (text)
Boilerplate is any text that is or can be reused in new contexts or applications without being changed much from the original. Many computer programmers often use the term boilerplate code. A legal boilerplate is a standard provision in a contract....

 form letter
Form letter
A form letter is a letter written from a template, rather than being specially composed for a specific recipient. The most general kind of form letter consists of one or more regions of boilerplate text interspersed with one or more substitution placeholders.Although form letters are generally...

s and mailing list
Mailing list
A mailing list is a collection of names and addresses used by an individual or an organization to send material to multiple recipients. The term is often extended to include the people subscribed to such a list, so the group of subscribers is referred to as "the mailing list", or simply "the...

s.

Some early software applications, such as word processor
Word processor
A word processor is a computer application used for the production of any sort of printable material....

s, allowed a stream of "variable data" to be merged between two files: a form letter, and a CSL of names, addresses, and other data fields. Many applications still do, simply because tasks requiring human input (such as constructing a list) are natural and easy using comma delimiters. CSL/CSVs were also used for simple database
Database
A database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...

s.

Background

Comma separated lists date from before the earliest personal computers, but were widely used in the earliest pre-IBM PC
IBM PC
The IBM Personal Computer, commonly known as the IBM PC, is the original version and progenitor of the IBM PC compatible hardware platform. It is IBM model number 5150, and was introduced on August 12, 1981...

 era personal computers for tape storage backup and interchange of database information from machines of two different architectures. In that day, affordable hard drives did not exist, and many small businesses tried to achieve the benefits of computing using floppy disk based software.

No general standard specification for CSV exists. Variations between CSV implementations in different programs are quite common and can lead to interoperation difficulties. For Internet communication of CSV files, an Informational IETF document (RFC
Request for Comments
In computer network engineering, a Request for Comments is a memorandum published by the Internet Engineering Task Force describing methods, behaviors, research, or innovations applicable to the working of the Internet and Internet-connected systems.Through the Internet Society, engineers and...

 4180 from October 2005) describes the format for the "text/csv" MIME type registered with the IANA
Internet Assigned Numbers Authority
The Internet Assigned Numbers Authority is the entity that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System , media types, and other Internet Protocol-related symbols and numbers...

. Another relevant specification is provided by Fielded Text
Fielded text
Fielded Text is a proposed standard which provides structure and schema definition to text files which contain tables of values . The standard allows the format and structure of the data within the text file to be specified by a Meta file...

 which also covers the CSV format.

Many informal documents exist that describe the CSV format. provides an overview of the CSV format in the most widely used applications and explains how it can best be used and supported.

Basic rules

The basic rules from a lot of these specifications are as follows:

CSV is a delimited
Delimited
Formats that use delimiter-separated values store two-dimensional arrays of data by separating the values in each row with specific delimiter characters...

 data format that has fields/columns
Field (computer science)
In computer science, data that has several parts can be divided into fields. Relational databases arrange data as sets of database records, also called rows. Each record consists of several fields; the fields of all records form the columns....

 separated by the comma
Comma (punctuation)
The comma is a punctuation mark. It has the same shape as an apostrophe or single closing quotation mark in many typefaces, but it differs from them in being placed on the baseline of the text. Some typefaces render it as a small line, slightly curved or straight but inclined from the vertical, or...

 character and records/rows
Row (database)
In the context of a relational database, a row—also called a record or tuple—represents a single, implicitly structured data item in a table. In simple terms, a database table can be thought of as consisting of rows and columns or fields...

 terminated by newline
Newline
In computing, a newline, also known as a line break or end-of-line marker, is a special character or sequence of characters signifying the end of a line of text. The name comes from the fact that the next character after the newline will appear on a new line—that is, on the next line below the...

s. Fields that contain a special character (comma, newline, or double quote), must be enclosed in double quotes. If a line contains a single entry which is the empty string, it may be enclosed in double quotes. If a field's value contains a double quote character it is escaped
Escape character
In computing and telecommunication, an escape character is a character which invokes an alternative interpretation on subsequent characters in a character sequence. An escape character is a particular case of metacharacters...

 by placing another double quote character next to it. The CSV file format does not require a specific character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...

, byte order, or line terminator format.
  • Each record is one line terminated by a line feed (ASCII/LF=0x0A) or a carriage return and line feed pair (ASCII/CRLF=0x0D 0x0A), however, line-breaks can be embedded.
  • Fields are separated by commas. (In locale
    Locale
    In computing, locale is a set of parameters that defines the user's language, country and any special variant preferences that the user wants to see in their user interface...

    s where the comma is used as a decimal separator
    Decimal separator
    Different symbols have been and are used for the decimal mark. The choice of symbol for the decimal mark affects the choice of symbol for the thousands separator used in digit grouping. Consequently the latter is treated in this article as well....

    , the semicolon is used instead as a delimiter. The different delimiters cause problems when CSV files are exchanged, for example, between France and USA.)

1997,Ford,E350
  • In some CSV implementations, leading and trailing spaces or tabs, adjacent to commas, are trimmed. This practice is contentious and in fact is specifically prohibited by RFC 4180, which states, "Spaces are considered part of a field and should not be ignored."

1997, Ford , E350
not same as
1997,Ford,E350
  • Fields with embedded commas must be enclosed within double-quote characters.

1997,Ford,E350,"Super, luxurious truck"
  • Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.

1997,Ford,E350,"Super, ""luxurious"" truck"
  • Fields with embedded line breaks must be enclosed within double-quote characters.

1997,Ford,E350,"Go get one now
they are going fast"
  • In CSV implementations that trim leading or trailing spaces, fields with such spaces must be enclosed within double-quote characters. (See comment about leading and trailing spaces above.)

1997,Ford,E350," Super luxurious truck "
  • Fields may always be enclosed within double-quote characters, whether necessary or not.

"1997","Ford","E350"
  • The first record in a csv file may contain column names in each of the fields.

Year,Make,Model
1997,Ford,E350
2000,Mercury,Cougar

Example

Year Make Model Description Price
1997 Ford E350 ac, abs, moon 3000.00
1999 Chevy Venture "Extended Edition" 4900.00
1999 Chevy Venture "Extended Edition, Very Large" 5000.00
1996 Jeep Grand Cherokee MUST SELL!
air, moon roof, loaded
4799.00


The above table of data may be represented in CSV format as follows:

Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!
air, moon roof, loaded",4799.00

This CSV example illustrates that:
  • fields that contain commas, double-quotes, or line-breaks must be quoted.
  • a quote within a field must be escaped with an additional quote immediately preceding the literal quote.
  • space before and after delimiter commas may not be trimmed. This is required by RFC 4180.
  • a line break within an element must be preserved.

Application support

The CSV file format is very simple and supported by almost all spreadsheet
Spreadsheet
A spreadsheet is a computer application that simulates a paper accounting worksheet. It displays multiple cells usually in a two-dimensional matrix or grid consisting of rows and columns. Each cell contains alphanumeric text, numeric values or formulas...

s and database management system
Database management system
A database management system is a software package with computer programs that control the creation, maintenance, and use of a database. It allows organizations to conveniently develop databases for various applications by database administrators and other specialists. A database is an integrated...

s. Many programming language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....

s have libraries available that support CSV files. Even modern software applications support CSV imports and/or exports because the format is so widely recognized. In fact, many applications allow .csv-named files to use any delimiter character.

Microsoft Excel
Microsoft Excel
Microsoft Excel is a proprietary commercial spreadsheet application written and distributed by Microsoft for Microsoft Windows and Mac OS X. It features calculation, graphing tools, pivot tables, and a macro programming language called Visual Basic for Applications...

 will open .csv files, but depending on the system's regional settings, it may expect a semicolon
Semicolon
The semicolon is a punctuation mark with several uses. The Italian printer Aldus Manutius the Elder established the practice of using the semicolon to separate words of opposed meaning and to indicate interdependent statements. "The first printed semicolon was the work of ... Aldus Manutius"...

 as a separator instead of a comma, since in some languages the comma is used as the decimal separator
Decimal separator
Different symbols have been and are used for the decimal mark. The choice of symbol for the decimal mark affects the choice of symbol for the thousands separator used in digit grouping. Consequently the latter is treated in this article as well....

. Also, many regional versions of Excel will not be able to deal with Unicode in CSV. One simple solution when encountering such difficulties is to change the filename extension from .csv to .txt; then opening the file from an already running Excel with the "Open" command.

When pasting text data into Excel, the tab character is used as a separator: If you copy "hellogoodbye" into the clipboard and paste it into Excel, it goes into two cells. "hello,goodbye" pasted into Excel goes into one cell, including the comma.

OpenOffice.org Calc
OpenOffice.org Calc
OpenOffice.org Calc is the spreadsheet component of the OpenOffice.org software package.Calc is similar to Microsoft Excel, with a roughly equivalent range of features. Calc is capable of opening and saving most spreadsheets in Microsoft Excel file format...

 and LibreOffice Calc
LibreOffice Calc
LibreOffice Calc is the spreadsheet component of the LibreOffice software package.Since forking from OpenOffice.org LibreOffice Calc has been modified with longstanding bugs addressed and sought-after features beginning to be added in...

 handle CSV files and pasted text with a Text Import dialog asking the user to manually specify the delimiters, encoding, format of columns, etc.

See also

  • Comparison of data serialization formats
    Comparison of data serialization formats
    This is a comparison of data serialization formats, different ways to convert complex objects to sequences of bits. It does not include markup languages used exclusively as document file formats.-Overview:*a. The current default format is binary....

  • CSV application support
    CSV application support
    The comma-separated values file format is a very simple data file format that is supported by almost all spreadsheet software such as Excel , Apple Numbers, OpenOffice.org Calc and Gnumeric as well as many online spreadsheet services such as EditGrid and Google Docs &...

  • Delimiter-separated values
  • Fielded text
    Fielded text
    Fielded Text is a proposed standard which provides structure and schema definition to text files which contain tables of values . The standard allows the format and structure of the data within the text file to be specified by a Meta file...

  • Tab-separated values
    Tab-separated values
    A tab-separated values file is a simple text format for a database table. Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab stop character – it is a form of the more general delimiter-separated values format.TSV is a simple file...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK