XML Schema Language Comparison
Encyclopedia
An XML schema
XML schema
An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself...

 is a description of a type of XML
Extensible Markup Language
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

 document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntax constraints imposed by XML itself. There are several different languages available for specifying an XML schema. Each language has its strengths and weaknesses.

Note: the W3C defined schema language is called "XML Schema". However, this name can be confusing in the context of referring to a number of XML schema languages. As such, throughout this document, references to the term "XML schema" will be any XML schema language where the meaning might be ambiguous, while the term "W3C XML Schema" (referred to in this article as WXS) will be used for the W3C-defined XML schema language.

Overview

Though there are a number of schema languages available, the primary three languages are Document Type Definition
Document Type Definition
Document Type Definition is a set of markup declarations that define a document type for SGML-family markup languages...

s, W3C XML Schema, and RELAX NG
RELAX NG
In computing, RELAX NG is a schema language for XML, based on Murata Makoto's RELAX and James Clark's TREX. A RELAX NG schema specifies a pattern for the structure and content of an XML document...

. Each language has its own advantages and disadvantages.

This article also covers a brief review of other schema languages.

The primary purpose of a schema language is to specify what the structure of an XML document can be. This means which elements can reside in which other elements, which attributes are and are not legal to have on a particular element, and so forth. A schema is somewhat equivalent to a grammar
Grammar
In linguistics, grammar is the set of structural rules that govern the composition of clauses, phrases, and words in any given natural language. The term refers also to the study of such rules, and this field includes morphology, syntax, and phonology, often complemented by phonetics, semantics,...

 for a language; a schema defines what the vocabulary for the language may be and what a valid "sentence" is.

Advantages

Of the primary three languages, DTDs are the only ones that can be defined inline. That is, the DTD can actually be embedded directly into the document.

DTDs can define more than merely the content model. It can define data elements that can be used in the document, much like a C or C++
C++
C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

 preprocessor may have #defines that are used internally.

The DTD language is compact and highly readable, though it does require some experience to understand.

Disadvantages

The primary disadvantage to DTDs is their weakness of specificity. The content models for DTDs are very basic, particularly compared to the other two languages.

Overuse of DTD-defined elements may make a document illegible or incomprehensible without the associated DTD. Additionally, there are several XML processors that, typically for ease-of-implementation reasons, do not understand DTDs. As such, if DTD-defined entities are being used, these XML processors will not recognize them.

The language that DTDs are written in is not XML. Therefore, DTDs cannot use the various frameworks that have been built around XML. XML editors that support writing DTDs must do so by parsing an additional language, for example. Some XML processors, typically for economy of implementation or execution, simply ignore DTD information, including DTD data elements.

The DTD concept for XML was borrowed from the SGML DTD concept. As such, the construct could not be changed when XML was extended with namespaces
XML Namespace
xmlns tagged XML namespaces are used for providing uniquely named elements and attributes in an XML document. They are defined in a W3C recommendation. An XML instance may contain element or attribute names from more than one XML vocabulary...

. As such, DTDs are namespace unaware.

There is limited support for defining the type of the contained data. DTDs are primarily structural in nature. They do not have the ability to specify that an element contains an integral number, real number, a date, or anything of that nature.

Tool Support

DTDs are perhaps the most widely supported schema language for XML. Because DTDs are one of the earliest schema languages for XML, defined before XML even had namespace support, they are widely supported. Internal DTDs are often supported in XML processors; external DTDs are less often supported, but only slightly. Most large XML parsers, ones that support multiple XML technologies, will provide support for DTDs as well.

Advantages over DTDs

Compared to DTDs, W3C XML Schemas are exceptionally powerful. They provide much greater specificity than DTDs could. They are namespace aware, and provide support for types.

W3C XML Schema is written in XML itself, and therefore has a schema of its own (appropriately, written in W3C XML Schema).

W3C XML Schema has a large number of built-in and derived data types. These are specified by the W3C XML Schema specification, so all W3C XML Schema validators and processors must support them.

Due to the nature of the schema language, after an XML document is validated, the entire XML document, both content and structure, can be expressed in terms of the schema itself. This functionality, known as Post-Schema-Validation Infoset (PSVI)
PSVI
PSVI is an acronym for Post-Schema-Validation Infoset, a term used in XML parsing. It is the extended infoset after the XML instance has been validated against the attached schema document and extends the XML infoset after validation. Briefly, an XML schema assigns an identifiable type to each...

, can be used to transform the document into a hierarchy of typed objects that can be accessed in a programming language through a neutral interface.

Commonality with RELAX NG

RELAX NG and W3C XML Schema allow for similar mechanisms of specificity. Both allow for a degree of modularity in their languages, going so far as to being able to split the schema into multiple files. And both of them are, or can be, defined in an XML language.

Advantages over RELAX NG

RELAX NG lacks any analog to PSVI
PSVI
PSVI is an acronym for Post-Schema-Validation Infoset, a term used in XML parsing. It is the extended infoset after the XML instance has been validated against the attached schema document and extends the XML infoset after validation. Briefly, an XML schema assigns an identifiable type to each...

. Unlike W3C XML Schema, RELAX NG was not designed with type assignment and data binding in mind.

W3C XML Schema has a formal mechanism for attaching a schema to an XML document.

RELAX NG has no ability to apply default attribute data to an element's list of attributes (i.e., changing the XML info set), while W3C XML Schema does.

W3C XML Schema has a rich "simple type" system built in (xs:number, xs:date, etc., plus derivation of custom types), while RELAX NG has an extremely simplistic one because it's meant to use type libraries developed independently of RELAX NG, rather than grow its own. This is seen by some as a disadvantage. In practice it's common for a RELAX NG schema to use the predefined "simple types" and "restrictions" (pattern, maxLength, etc.) of W3C XML Schema.

In W3C XML Schema a specific number or range of repetitions of patterns can be expressed more elegantly than under RELAX NG. For large numbers it's practically not possible to specify at all in RELAX NG.

Disadvantages

W3C XML Schema is complex and hard to learn, although that's partially because it tries to do more than mere validation (see PSVI
PSVI
PSVI is an acronym for Post-Schema-Validation Infoset, a term used in XML parsing. It is the extended infoset after the XML instance has been validated against the attached schema document and extends the XML infoset after validation. Briefly, an XML schema assigns an identifiable type to each...

).

Although being written in XML is an advantage, it is also a disadvantage in some ways. The W3C XML Schema language in particular can be quite verbose, while a DTD can be terse and relatively easily editable.

Likewise, WXS's formal mechanism for associating a document with a schema can pose a potential security problem. For WXS validators that will follow a URI
Úri
Úriis a village and commune in the comitatus of Pest in Hungary....

 to an arbitrary online location, there is the potential for reading something malicious from the other side of the stream.

W3C XML Schema does not implement most of the DTD ability to provide data elements to a document. While technically a comparative deficiency, it also does not have the problems that this ability can create as well, which makes it a strength.

Although W3C XML Schema's ability to add default attributes to elements is an advantage, it is a disadvantage in some ways as well. It means that an XML file may not be usable in the absence of its schema, even if the document would validate against that schema. In effect, all users of such an XML document must also implement the W3C XML Schema specification, thus ruling out minimalist or older XML parsers. It can also dramatically slow down processing of the document, as the processor must potentially download and process a second XML file (the schema).

Tool Support

WXS support exists in a number of large XML parsing packages. Xerces
Xerces
Xerces is a collection of software libraries for parsing, validating, serializing and manipulating XML. The library implements a number of standard APIs for XML parsing, including DOM, SAX and SAX2. The implementation is available in Java, C++ and Perl programming languages.-External...

 and the .NET Framework
.NET Framework
The .NET Framework is a software framework that runs primarily on Microsoft Windows. It includes a large library and supports several programming languages which allows language interoperability...

's Base Class Library
Base Class Library
The Base Class Library is a standard library available to all languages using the .NET Framework. .NET includes the BCL in order to encapsulate a large number of common functions, such as file reading and writing, graphic rendering, database interaction, and XML document manipulation, which makes...

 both provide support for WXS validation.

RELAX NG

RELAX NG provides for most of the advantages that W3C XML Schema does over DTDs.

Advantages over W3C XML Schema

While the language of RELAX NG can be written in XML, it also has an equivalent form that is much more like a DTD, but with greater specifying power. This form is known as the compact syntax. Tools can easily convert between these forms with no loss of features or even commenting. Even arbitrary elements specified between RELAX NG XML elements can be converted into the compact form.

RELAX NG provides very strong support for unordered content. That is, it allows the schema to state that a sequence of patterns may appear in any order.

RELAX NG also allows for non-deterministic content models. What this means is that RELAX NG allows the specification of a sequence like the following:











When the validator encounters something that matches the "odd" pattern, it is unknown whether this is the optional last "odd" reference or simply one in the zeroOrMore sequence without looking ahead at the data. RELAX NG allows this kind of specification. W3C XML Schema requires all of its sequences to be fully deterministic, so mechanisms like the above must be either specified in a different way or omitted altogether.

RELAX NG allows attributes to be treated as elements in content models. In particular, this means that one can provide the following:





false



true







This block states that the element "some_element" must have an attribute named "has_name". This attribute can only take true or false as values, and if it is true, the first child element of the element must be "name", which stores text. If "name" did not need to be the first element, then the choice could be wrapped in an "interleave" element along with other elements. The order of the specification of attributes in RELAX NG has no meaning, so this block need not be the first block in the element definition.

W3C XML Schema cannot specify such a dependency between the content of an attribute and child elements.

RELAX NG's specification only lists two built-in types (string and token), but it allows for the definition of many more. In theory, the lack of a specific list allows a processor to support data types that are very problem-domain specific.

Most RELAX NG schemas can be algorithmically converted into W3C XML Schemas and even DTDs (except when using RELAX NG features not supported by those languages, as above). The reverse is not true. As such, RELAX NG can be used as a normative version of the schema, and the user can convert it to other forms for tools that do not support RELAX NG.

Disadvantages

Most of RELAX NG's disadvantages are covered under the section on W3C XML Schema's advantages over RELAX NG.

Though RELAX NG's ability to support user-defined data types is useful, it comes at the disadvantage of only having two data types that the user can rely upon. Which, in theory, means that using a RELAX NG schema across multiple validators requires either providing those user-defined data types to that validator or using only the two basic types. In practice however, most RELAX NG processors support the W3C XML Schema set of data types.

Tool Support

RELAX NG's tool support is significant, but it is less widespread than W3C XML Schema. The Mono Project
Mono (software)
Mono, pronounced , is a free and open source project led by Xamarin to create an Ecma standard compliant .NET-compatible set of tools including, among others, a C# compiler and a Common Language Runtime....

's implementation of the .NET Framework includes a RELAX NG validator. The C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

 library libxml2 provides RELAX NG support as well. Sun Microsystems
Sun Microsystems
Sun Microsystems, Inc. was a company that sold :computers, computer components, :computer software, and :information technology services. Sun was founded on February 24, 1982...

's Multiple Schema Validator for Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

 also provides RELAX NG support.

Schematron

Schematron is a fairly unique schema language. Unlike the main three, it defines an XML file's syntax as a list of XPath
XPath
XPath is a language for selecting nodes from an XML document. In addition, XPath may be used to compute values from the content of an XML document...

-based rules. If the document passes these rules, then it is valid.

Advantages

Because of its rule-based nature, Schematron's specificity is very strong. It can require that the content of an element be controlled by one of its siblings. It can also request or require that the root element, regardless of what element that happens to be, have specific attributes. It can even specify required relationships between multiple XML files.

Disadvantages

While Schematron is good at relational constructs, its ability to specify the basic structure of a document, that is, which elements can go where, results in a very verbose schema.

The typical way to solve this is to combine Schematron with RELAX NG or W3C XML Schema. There are several schema processors available for both languages that support this combined form. This allows Schematron rules to specify additional constraints to the structure defined by W3C XML Schema or RELAX NG.

Tool Support

Schematron's reference implementation is actually an XSLT
XSLT
XSLT is a declarative, XML-based language used for the transformation of XML documents. The original document is not changed; rather, a new document is created based on the content of an existing one. The new document may be serialized by the processor in standard XML syntax or in another format,...

 transformation that transforms the Schematron document into an XSLT that validates the XML file. As such, Schematron's potential toolset is any XSLT processor, though libxml2 provides an implementation that does not require XSLT. Sun Microsystems
Sun Microsystems
Sun Microsystems, Inc. was a company that sold :computers, computer components, :computer software, and :information technology services. Sun was founded on February 24, 1982...

's Multiple Schema Validator for Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

 has an add-on that allows it to validate RELAX NG schemas that have embedded Schematron rules.

Namespace Routing Language (NRL)

This is not technically a schema language. Its sole purpose is to direct parts of documents to individual schemas based on the namespace of the encountered elements. An NRL is merely a list of XML namespaces
XML Namespace
xmlns tagged XML namespaces are used for providing uniquely named elements and attributes in an XML document. They are defined in a W3C recommendation. An XML instance may contain element or attribute names from more than one XML vocabulary...

 and a path to a schema that each corresponds to. This allows each schema to be concerned with only its own language definition, and the NRL file routes the schema validator to the correct schema file based on the namespace of that element.

This XML format is schema-language agnostic and works for just about any schema language.

See also

  • Document Type Definition
    Document Type Definition
    Document Type Definition is a set of markup declarations that define a document type for SGML-family markup languages...

  • Document Structure Description
    Document Structure Description
    Document Structure Description, or DSD, is a schema language for XML, that is, a language for describing valid XML documents. It's an alternative to DTD or the W3C XML Schema.An example of DSD in its simplest form:...

  • W3C XML Schema
  • RELAX NG
    RELAX NG
    In computing, RELAX NG is a schema language for XML, based on Murata Makoto's RELAX and James Clark's TREX. A RELAX NG schema specifies a pattern for the structure and content of an XML document...

  • OASIS CAM
  • Schematron
    Schematron
    In markup languages, Schematron is a rule-based validation language for making assertions about the presence or absence of patterns in XML trees...

  • Namespace Routing Language
    Namespace Routing Language
    In its simplest form, a Namespace Routing Language schema consists of a mapping from namespace URIs to schema URIs. An NRL schema is written in XML.DSDL Part 4 , NVDL is based on NRL.- External links :**...

  • Namespace-based Validation Dispatching Language
    Namespace-based Validation Dispatching Language
    Namespace-based Validation Dispatching Language is an XML schema language for validating XML documents that integrate with multiple namespaces. It is an ISO/IEC standard, and it is Part 4 of the DSDL schema specification. Much of the work on NVDL is based on the older Namespace Routing Language.-...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK