CDATA
Encyclopedia
The term CDATA, meaning character data, is used for distinct, but related, purposes in the markup language
Markup language
A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...

s SGML
Standard Generalized Markup Language
The Standard Generalized Markup Language is an ISO-standard technology for defining generalized markup languages for documents...

 and XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

. The term indicates that a certain portion of the document is general character data, rather than non-character data or character data with a more specific, limited structure.

CDATA sections in XML

In an XML document or external parsed entity, a CDATA section is a section of element content that is marked for the parser to interpret as only character data, not markup. A CDATA section is merely an alternative syntax for expressing character data; there is no semantic difference between character data that manifests as a CDATA section and character data that manifests as in the usual syntax in which "<" and "&" would be represented by "&lt;" and "&amp;", respectively.

Syntax and interpretation

A CDATA section starts with the following sequence:



and ends with the first occurrence of the sequence:

]]>

All characters enclosed between these two sequences are interpreted as characters, not markup or entity references. For example, in a line like this:

John Smith

the opening and closing "sender" tags are interpreted as markup. However, if written like this:

John Smith]]>

then the code is interpreted the same as if it had been written like this:

<sender>John Smith</sender>

That is, the "sender" tags will have exactly the same status as the "John Smith"— they will be treated as text.

Similarly, if the numeric character reference
Numeric character reference
A numeric character reference is a common markup construct used in SGML and other SGML-related markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set of Unicode...

 &#240; appears in element content, it will be interpreted as the single Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 character (small letter eth
Eth
Eth is a letter used in Old English, Icelandic, Faroese , and Elfdalian. It was also used in Scandinavia during the Middle Ages, but was subsequently replaced with dh and later d. The capital eth resembles a D with a line through the vertical stroke...

). But if the same appears in a CDATA section, it will be parsed as six characters: ampersand, hash mark, digit 2, digit 4, digit 0, semicolon.

Uses of CDATA sections

New authors of XML documents often misunderstand the purpose of a CDATA section, mistakenly believing that its purpose is to "protect" data from being treated as ordinary character data during processing. Some APIs for working with XML documents do offer options for independent access to CDATA sections, but such options exist above and beyond the normal requirements of XML processing systems, and still do not change the implicit meaning of the data. Character data is character data, regardless of whether it is expressed via a CDATA section or ordinary markup. CDATA sections are useful for writing XML code as text data within an XML document. For example, if one wishes to typeset a book with XSL
Extensible Stylesheet Language
In computing, the term Extensible Stylesheet Language is used to refer to a family oflanguages used to transform and render XML documents....

 explaining the use of an XML application, the XML markup to appear in the book itself will be written in the source file in a CDATA section.

Nesting

A CDATA section cannot contain the string "]]>" and therefore it is not possible for a CDATA section to contain nested CDATA sections. The preferred approach to using CDATA sections for encoding text that contains the triad "]]>" is to use multiple CDATA sections by splitting each occurrence of the triad just before the ">". For example, to encode "]]>" one would write:

]]>

This means that to encode "]]>" in the middle of a CDATA section, replace all occurrences of "]]>" with the following:

]]]]>

This effectively stops and restarts the CDATA section.

For generating XML "by hand", CDATA sections do not remove the need for escaping. The string ]]> (the CDATA end marker) must be escaped with a string such as ]]]]>, which breaks the string across separate CDATA sections. An alternative to using CDATA sections which may be simpler in some circumstances is to escape the single characters & and < (normally using &amp; or &#38; and &lt; or &#60;). The different approaches produce equally valid XML, and most XML parsers will not preserve the distinctions between them in their output.

Use of CDATA in program output

CDATA sections in XHTML
XHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....

 documents are liable to be parsed differently by web browsers if they render the document as HTML, since HTML parsers do not recognise the CDATA start and end markers, nor do they recognise HTML entity references such as &lt; within <script> tags. This can cause rendering problems in web browsers and can lead to cross-site scripting
Cross-site scripting
Cross-site scripting is a type of computer security vulnerability typically found in Web applications that enables attackers to inject client-side script into Web pages viewed by other users. A cross-site scripting vulnerability may be used by attackers to bypass access controls such as the same...

 vulnerabilities if used to display data from untrusted sources, since the two kinds of parser will disagree on where the CDATA section ends.

Since it is useful to be able to use less-than signs (<) and ampersands (&) in web page scripts, and to a lesser extent styles, without having to remember to escape them, it is common to use CDATA markers around the text of inline <script> and <style> elements in XHTML documents. But so that the document can also be parsed by HTML parsers, which do not recognise the CDATA markers, the CDATA markers are usually commented-out, as in this JavaScript
JavaScript
JavaScript is a prototype-based scripting language that is dynamic, weakly typed and has first-class functions. It is a multi-paradigm language, supporting object-oriented, imperative, and functional programming styles....

 example:





or this CSS
Cascading Style Sheets
Cascading Style Sheets is a style sheet language used to describe the presentation semantics of a document written in a markup language...

 example:





This technique is only necessary when using inline scripts and stylesheets, and is language-specific. CSS stylesheets, for example, only support the second style of commenting-out , but CSS also has less need for the < and & characters than JavaScript and so less need for explicit CDATA markers.

CDATA-type attribute value

In Document Type Definition
Document Type Definition
Document Type Definition is a set of markup declarations that define a document type for SGML-family markup languages...

 (DTD) files for SGML and XML, an attribute value may be designated as being of type CDATA: arbitrary character data. Within a CDATA-type attribute, character and entity reference markup is allowed and will be processed when the document is read.

For example, if an XML DTD contains



it means that elements named foo may optionally have an attribute named "a" which is of type CDATA. In an XML document that is valid according to this DTD, an element like this might appear:



and an XML parser would interpret the "a" attribute's value as being the character data "1 & 2 are < 3".

CDATA-type entity

An SGML or XML DTD may also include entity declarations in which the token CDATA is used to indicate that entity consists of character data. The character data may appear within the declaration itself or may be available externally, referenced by a URI
Uniform Resource Identifier
In computing, a uniform resource identifier is a string of characters used to identify a name or a resource on the Internet. Such identification enables interaction with representations of the resource over a network using specific protocols...

. In either case, character reference and parameter entity reference markup is allowed in the entity, and will be processed as such when it is read.

CDATA-type element content

An SGML DTD may declare an element's content as being of type CDATA. Within a CDATA-type element, no markup will be processed. It is similar to a CDATA section in XML, but has no special boundary markup, as it applies to the entire element.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK