Canonical XML
Encyclopedia
Canonical XML is a profile or subset of XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

. Any XML document can be converted to Canonical XML, thus normalizing away specific kinds of minor differences while remaining an XML document. Because those specific differences are generally not considered to be "meaningful", converting to Canonical XML is a good way to determine whether two XML documents are logically "the same document" despite differences of detail.

For example, XML permits whitespace to occur at various points within start-tags, and attributes to be specified in any order. Such differences are seldom if ever used to convey meaning, and so these forms are generally considered equivalent:

<p class="a" secure="1">

<p secure = "1"
class='a' >

In converting an arbitrary XML document to Canonical XML, attributes are encoded in a normative order (alphabetical by name), and with normative spacing and quoting. Thus, the second form above would be converted to the first.

Canonical XML specifies a number of other details, some of which are:
  • the UTF-8
    UTF-8
    UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

     encoding is used
  • line-ends are represented using the character 0x0A
  • whitespace in attribute values is normalized
  • entity references are expanded
  • CDATA marked sections are not used
  • empty elements are encoded as start/end pairs, not using the special empty-element syntax
  • default attributes are made explicit
  • superfluous namespace declarations are deleted


Converting a document to Canonical XML is idempotent. That is, the first conversion usually will result in a different string of characters than the original, but repeated conversions will make no further changes.

According to the W3C, if two XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 documents have the same canonical form, then the two documents are logically equivalent within the given application context (except for limitations regarding a few unusual cases).

However, in a special context users might care about special semantics beyond the generic logical equivalence with which Canonical XML is associated. For example, a steganography
Steganography
Steganography is the art and science of writing hidden messages in such a way that no one, apart from the sender and intended recipient, suspects the existence of the message, a form of security through obscurity...

system could conceal information in an XML document by varying whitespace, attribute quoting and order, the use of hexadecimal vs. decimal numeric character references, and so on. Obviously converting such a file to Canonical XML would lose those specialized semantics. On the other hand, XML files that differ in their use of upper- vs. lower-case, or that use archaic versus modern spelling, and so on, might be considered equivalent for certain purposes. Such contexts are beyond the scope of Canonical XML.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK