Standard Generalized Markup Language
Encyclopedia
The Standard Generalized Markup Language (ISO 8879:1986 SGML) is an ISO
-standard technology for defining generalized markup language
s for documents. ISO 8879 Annex A.1 defines generalized markup:
HTML
, XHTML
, and XML
are all examples of SGML-based languages.
standard: "ISO 8879:1986 Information processing — Text and office systems — Standard Generalized Markup Language (SGML)", of which there are three versions:
SGML is part of a trio of enabling ISO standards for electronic document
s developed by ISO/IEC JTC1/SC34
(ISO/IEC Joint Technical Committee 1, Subcommittee 34 – Document description and processing languages) :
SGML is supported by various technical reports, in particular
, Edward Mosher, and Raymond Lorie developed in the 1960s. Goldfarb, editor of the international standard, coined the “GML” term using their surname initials. As a document markup language, SGML was originally designed to enable the sharing of machine-readable large-project documents in government, law, and industry. Many such documents must remain readable for several decades—a long time in the information technology
field. SGML also was extensively applied by the military, and the aerospace, technical reference, and industrial publishing industries. The advent of the XML
profile has made SGML suitable for widespread application for small-scale, general-purpose use.
A type-valid SGML document is defined by the standard as
A tag-valid SGML document is defined by the standard as
which allows documents with no DOCTYPE declaration but which can be parsed without a grammar or documents which have a DOCTYPE declaration that makes no XML Infoset contributions to the document. . The standard calls this fully tagged. Integrally stored reflects the XML
requirement that elements end in the same entity in which they started. Reference-free reflects the HTML
requirement that entity references are for special characters and do not contain markup. SGML validity commentary, especially commentary that was made before 1997 or that is unaware of SGML (ENR+WWW), covers type-validity only.
The SGML emphasis on validity supports the requirement for generalized markup that markup should be rigorous. (ISO 8879 A.1)
An SGML document may be composed from many entities
(discrete pieces of text). In SGML, the entities and element types used in the document may be specified with a DTD, the different character sets, features, delimiter sets, and keywords are specified in the SGML Declaration to create the concrete syntax of the document.
Although full SGML allows implicit markup and some other kinds of tags, the XML
specification (s4.3.1) states:
For introductory information on basic, modern SGML syntax, see XML
. The following material concentrates on features not in XML and is not a comprehensive summary of SGML syntax.
-like syntaxes to RTF
-like bracketed languages to HTML
-like matching-tag languages. SGML did this by a relatively simple default reference concrete syntax augmented with a large number of optional features that could be enabled in the SGML Declaration. Not every SGML parser can necessarily process every SGML document. Because each processor's System Declaration can be compared to the document's SGML Declaration it is always possible to know whether a document is supported by a particular processor.
Many SGML features relate to markup minimization. Other features relate to parallel asynchronous markup (CONCUR), to linking processing attributes (LINK), and to embedding SGML documents within SGML documents (SUBDOC).
The notion of customizable features was not appropriate for Web use, so one goal of XML
was to minimize optional features. However XML's well-formedness rules cannot support Wiki-like languages, leaving them unstandardized and difficult to integrate with non-text information systems.
concrete syntax:
SGML provides an abstract syntax that can be implemented
in many different types of concrete syntax. Although the markup norm is using angle brackets
as start- and end- tag delimiter
s in an SGML document (per the standard-defined reference concrete syntax), it is possible to use other characters—provided a suitable concrete syntax is defined in the document's SGML declaration. For example, an SGML interpreter might be programmed to parse GML, wherein the tags are delimited with a left colon
and a right full stop
, thus, an :e prefix denotes an end tag:
counterpart would be the specific empty tag
markup, e.g. wherein two equals-signs (
One feature of SGML markup languages is the "presumptuous empty tagging", such that the empty end tag
can be written as:
Wherein the first slash
( / ) stands for the NET-enabling “start-tag close” (NETSC), and the second slash stands for the NET. NOTE: XML defines NETSC with a / , and NET with an > (angled bracket)—hence the corresponding construct in XML appears as
The third feature is 'text on the same line', allowing a markup item to be ended with a line-end; especially useful for headings and such, requiring using either SHORTREF or DATATAG minimization. For example, if the DTD includes the following declarations:
(and "RE;RS;" is a short-reference delimiter in the concrete syntax), then:
is equivalent to:
and the contemporary parser
technology of the 1980s and the 1990s. The standard warns in Annex H:
A report on an early implementation of a parser for basic SGML, the Amsterdam SGML Parser, notes and specifies various differences.
There appears to be no definitive classification of full SGML against a known class of formal grammar
. Plausible classes may include tree-adjoining grammar
s and adaptive grammar
s.
XML is described as being generally parsable like a two-level grammar
for non-validated XML and a Conway
-style pipeline of coroutines (lexer
, parser, validator) for valid XML. The SGML productions in the ISO standard are reported to be LL(3) or LL(4). XML-class subsets are reported to be expressible using a W-grammar. According to one paper, and probably considered at an information set
or parse tree
level rather than a character or delimiter level:
The SGML standard does not define SGML with formal data structures, such as parse tree
s, however, an SGML document is constructed of a rooted directed acyclic graph
(RDAG) of physical storage units known as “entities
”, which is parsed into a RDAG of structural units known as “elements”. The physical graph is loosely characterized as an entity tree, but entities might appear multiple times. Moreover, the structure graph is also loosely characterized as an element tree, but the ID/IDREF markup allows arbitrary arcs.
The results of parsing can also be understood as a data tree in different notations; where the document is the root node, and entities in other notations (text, graphics) are child nodes. SGML provides apparatus for linking to and annotating external non-SGML entities.
The SGML standard describes it in terms of maps and recognition modes (s9.6.1). Each entity, and each element, can have an associated notation or declared content type, which determines the kinds of references and tags which will be recognized in that entity and element. Also, each element can have an associated delimiter map (and short reference map), which determines which characters are treated as delimiters in context. The SGML standard characterizes parsing as a state machine switching between recognition modes. During parsing, there is a stack of maps that configure the scanner
, while the tokenizer
relates to the recognition modes.
Parsing involves traversing the dynamically-retrieved entity graph, finding/implying tags and the element structure, and validating those tags against the grammar. An unusual aspect of SGML is that the grammar (DTD) is used both passively — to recognize lexical structures, and actively — to generate missing structures and tags that the DTD has declared optional. End- and start- tags can be omitted, because they can be inferred. Loosely, a series of tags can be omitted only if there is a single, possible path in the grammar to imply them. It was this active use of grammars that made concrete SGML parsing difficult to formally characterize.
SGML uses the term validation for both recognition and generation. XML does not use the grammar (DTD) to change delimiter maps or to inform the parse modes, and does not allow tag omission; consequently, XML validation of elements is not active in the sense that SGML validation is active. SGML without a DTD (e.g. simple XML), is a grammar or a language; SGML with a DTD is a metalanguage
. SGML with an SGML declaration is, perhaps, a meta-metalanguage, since it is a metalanguage whose declaration mechanism is a metalanguage.
SGML has an abstract syntax implemented by many possible concrete syntaxes, however, this is not the same usage as in an abstract syntax tree
and as in a concrete syntax tree
. In the SGML usage, a concrete syntax is a set of specific delimiters, while the abstract syntax is the set of names for the delimiters. The XML Infoset corresponds more to the programming language notion of abstract syntax introduced by John McCarthy
.
based on Unicode
. Applications of XML include XHTML
, XQuery
, XSLT
, XForms
, XPointer
, JSP
, SVG, RSS
, Atom
, XML-RPC
, RDF/XML
, and SOAP.
, intended it to be an application of SGML. The design of HTML (Hyper Text Markup Language) was therefore inspired by SGML tagging, but, since no clear expansion and parsing guidelines were established, most actual HTML documents are not valid SGML documents. Later, HTML was reformulated (version 2.0) to be more of an SGML application, however, the HTML markup language has many legacy- and exception- handling features that differ from SGML's requirements. HTML 4 is an SGML application that fully conforms to ISO 8879 – SGML.
The charter for the recently revived World Wide Web Consortium
HTML Working Group says, "the Group will not assume that an SGML parser is used for 'classic HTML'". Although HTML syntax closely resembles SGML syntax with the default reference concrete syntax, HTML5 (reportedly) abandons conforming with SGML, explicitly defining its own serialization, although, it also defines an alternative XML-based XHTML serialization, which does conform to SGML (WWW).
(OED) is entirely marked up with an SGML-esque document markup language.
markup language for typesetting and documentation, is an example.
Several modern programming languages support tags as primitive token types, or now support Unicode and regular expression
pattern-matching. An example is the Scala programming language.
SP and Jade, the associated DSSSL processors, are maintained by the OpenJade project, and are common parts of Linux distributions. A general archive of SGML software and materials resides at SUNET. The original HTML parser class, in Sun System's implementation of Java, is a limited-features SGML parser, using SGML terminology and concepts.
International Organization for Standardization
The International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...
-standard technology for defining generalized markup language
Markup language
A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...
s for documents. ISO 8879 Annex A.1 defines generalized markup:
Generalized markup is based on two novel postulates:
- Markup should be declarative: it should describe a document's structure and other attributes, rather than specify the processing to be performed on it. Declarative markup better anticipates unforeseen future processing needs and techniques.
- Markup should be rigorous so that the techniques available for processing rigorously-defined objects like programs and database
DatabaseA database is an organized collection of data for one or more purposes, usually in digital form. The data are typically organized to model relevant aspects of reality , in a way that supports processes requiring this information...
s can be used for processing documents as well.
HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
, XHTML
XHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....
, and XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
are all examples of SGML-based languages.
Standard versions
SGML is an ISOInternational Organization for Standardization
The International Organization for Standardization , widely known as ISO, is an international standard-setting body composed of representatives from various national standards organizations. Founded on February 23, 1947, the organization promulgates worldwide proprietary, industrial and commercial...
standard: "ISO 8879:1986 Information processing — Text and office systems — Standard Generalized Markup Language (SGML)", of which there are three versions:
- Original SGML, which was accepted in October 1986, followed by a minor Technical Corrigendum.
- SGML (ENR), in 1996, resulted from a Technical Corrigendum to add extended naming rules allowing arbitrary-language and -script markup.
- SGML (ENR+WWW or WebSGML), in 1998, resulted from a Technical Corrigendum to better support XML and WWW requirements.
SGML is part of a trio of enabling ISO standards for electronic document
Electronic document
An electronic document is any electronic media content that are intended to be used in either an electronic form or as printed output....
s developed by ISO/IEC JTC1/SC34
ISO/IEC JTC1/SC34
ISO/IEC JTC 1/SC 34 titled as Document description and processing languages is a subcommittee of the ISO/IEC JTC1 joint technical committee, which is a collaborative effort of both the International Organization for Standardization and the International Electrotechnical Commission.-Scope and Terms...
(ISO/IEC Joint Technical Committee 1, Subcommittee 34 – Document description and processing languages) :
- SGML (ISO 8879)—generalized markup language
- SGML was reworked in 1998 into XMLXMLExtensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
, a successful profile of SGML. Full SGML is rarely found or used in new projects.
- SGML was reworked in 1998 into XML
- DSSSL (ISO/IEC 10179)—document processing and styling language based on Scheme.
- DSSSL was reworked into W3C XSLTXSLTXSLT is a declarative, XML-based language used for the transformation of XML documents. The original document is not changed; rather, a new document is created based on the content of an existing one. The new document may be serialized by the processor in standard XML syntax or in another format,...
and XSL-FO which use an XML syntax. Nowadays, DSSSL is rarely used in new projects apart from LinuxLinuxLinux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
documentation.
- DSSSL was reworked into W3C XSLT
- HyTimeHyTimeHyTime is a markup language that is an "application" of SGML. HyTime defines a set of hypertext-oriented element types that, in effect, supplement SGML and allow SGML document authors to build hypertext and multimedia presentations in a standardized way.HyTime is an international standard...
—Generalized hypertext and scheduling.- HyTime was partially reworked into W3C XLinkXLinkXML Linking Language, or XLink, is an XML markup language and W3C specification that provides methods for creating internal and external links within XML documents, and associating metadata with those links.-The XLink specification:...
. HyTime is rarely used in new projects.
- HyTime was partially reworked into W3C XLink
SGML is supported by various technical reports, in particular
- ISO/IEC TR 9573 – Information processing – SGML support facilities – Techniques for using SGML
- Part 13: Public entity sets for mathematics and science
- In 2007, the W3C MathML working group agreed to assume the maintenance of these entity sets.
- Part 13: Public entity sets for mathematics and science
History
SGML descended from IBM's Generalized Markup Language (GML), which Charles GoldfarbCharles Goldfarb
Charles F. Goldfarb is known as the father of SGML and is a co-inventor of the concept of markup languages. In 1969 Charles Goldfarb, leading a small team at IBM, developed the first markup language, called Generalized Markup Language, or GML. In an , Dr...
, Edward Mosher, and Raymond Lorie developed in the 1960s. Goldfarb, editor of the international standard, coined the “GML” term using their surname initials. As a document markup language, SGML was originally designed to enable the sharing of machine-readable large-project documents in government, law, and industry. Many such documents must remain readable for several decades—a long time in the information technology
Information technology
Information technology is the acquisition, processing, storage and dissemination of vocal, pictorial, textual and numerical information by a microelectronics-based combination of computing and telecommunications...
field. SGML also was extensively applied by the military, and the aerospace, technical reference, and industrial publishing industries. The advent of the XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
profile has made SGML suitable for widespread application for small-scale, general-purpose use.
Document validity
SGML (ENR+WWW) defines two kinds of validity. According to the revised Terms and Definitions of IS 8879 (from the public draft:
A conforming SGML document must be either a type-valid SGML document, a tag-valid SGML document, or both. Note: A user may wish to enforce additional constraints on a document, such as whether a document instance is integrally-stored or free of entity references.
A type-valid SGML document is defined by the standard as
An SGML document in which, for each document instance, there is an associated document type declarationDocument Type DeclarationA Document Type Declaration, or DOCTYPE, is an instruction that associates a particular SGML or XML document with a Document Type Definition...
(DTD) to whose DTD that instance conforms.
A tag-valid SGML document is defined by the standard as
An SGML document, all of whose document instances are fully tagged. There need not be a document type declarationDocument Type DeclarationA Document Type Declaration, or DOCTYPE, is an instruction that associates a particular SGML or XML document with a Document Type Definition...
associated with any of the instances. Note: If there is a document type declarationDocument Type DeclarationA Document Type Declaration, or DOCTYPE, is an instruction that associates a particular SGML or XML document with a Document Type Definition...
, the instance can be parsed with or without reference to it.
Terminology
Tag-validity was introduced in SGML (ENR+WWW) to support XMLXML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
which allows documents with no DOCTYPE declaration but which can be parsed without a grammar or documents which have a DOCTYPE declaration that makes no XML Infoset contributions to the document. . The standard calls this fully tagged. Integrally stored reflects the XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
requirement that elements end in the same entity in which they started. Reference-free reflects the HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
requirement that entity references are for special characters and do not contain markup. SGML validity commentary, especially commentary that was made before 1997 or that is unaware of SGML (ENR+WWW), covers type-validity only.
The SGML emphasis on validity supports the requirement for generalized markup that markup should be rigorous. (ISO 8879 A.1)
Syntax
An SGML document may have three parts:- the SGML Declaration,
- the Prologue, containing a DOCTYPE declaration with the various markup declarations that together make a Document Type DefinitionDocument Type DefinitionDocument Type Definition is a set of markup declarations that define a document type for SGML-family markup languages...
(DTD), and - the instance itself, containing one top-most element and its contents.
An SGML document may be composed from many entities
SGML entity
In the Standard Generalized Markup Language , an entity is a primitive data type, which associates a string with either a unique alias or an SGML reserved word . Entities are foundational to the organizational structure and definition of SGML documents...
(discrete pieces of text). In SGML, the entities and element types used in the document may be specified with a DTD, the different character sets, features, delimiter sets, and keywords are specified in the SGML Declaration to create the concrete syntax of the document.
Although full SGML allows implicit markup and some other kinds of tags, the XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
specification (s4.3.1) states:
For introductory information on basic, modern SGML syntax, see XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
. The following material concentrates on features not in XML and is not a comprehensive summary of SGML syntax.
Optional features
SGML generalizes and supports a wide range of markup languages as found in the mid 1980s. These ranged from terse WikiWiki
A wiki is a website that allows the creation and editing of any number of interlinked web pages via a web browser using a simplified markup language or a WYSIWYG text editor. Wikis are typically powered by wiki software and are often used collaboratively by multiple users. Examples include...
-like syntaxes to RTF
Rich Text Format
The Rich Text Format is a proprietary document file format with published specification developed by Microsoft Corporation since 1987 for Microsoft products and for cross-platform document interchange....
-like bracketed languages to HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
-like matching-tag languages. SGML did this by a relatively simple default reference concrete syntax augmented with a large number of optional features that could be enabled in the SGML Declaration. Not every SGML parser can necessarily process every SGML document. Because each processor's System Declaration can be compared to the document's SGML Declaration it is always possible to know whether a document is supported by a particular processor.
Many SGML features relate to markup minimization. Other features relate to parallel asynchronous markup (CONCUR), to linking processing attributes (LINK), and to embedding SGML documents within SGML documents (SUBDOC).
The notion of customizable features was not appropriate for Web use, so one goal of XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
was to minimize optional features. However XML's well-formedness rules cannot support Wiki-like languages, leaving them unstandardized and difficult to integrate with non-text information systems.
Concrete and abstract syntaxes
The usual (default) SGML concrete syntax resembles this example, which is the default HTMLHTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
concrete syntax:
SGML provides an abstract syntax that can be implemented
Implementation
Implementation is the realization of an application, or execution of a plan, idea, model, design, specification, standard, algorithm, or policy.-Computer Science:...
in many different types of concrete syntax. Although the markup norm is using angle brackets
Bracket
Brackets are tall punctuation marks used in matched pairs within text, to set apart or interject other text. In the United States, "bracket" usually refers specifically to the "square" or "box" type.-List of types:...
as start- and end- tag delimiter
Delimiter
A delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values.Delimiters represent...
s in an SGML document (per the standard-defined reference concrete syntax), it is possible to use other characters—provided a suitable concrete syntax is defined in the document's SGML declaration. For example, an SGML interpreter might be programmed to parse GML, wherein the tags are delimited with a left colon
Colon (punctuation)
The colon is a punctuation mark consisting of two equally sized dots centered on the same vertical line.-Usage:A colon informs the reader that what follows the mark proves, explains, or lists elements of what preceded the mark....
and a right full stop
Full stop
A full stop is the punctuation mark commonly placed at the end of sentences. In American English, the term used for this punctuation is period. In the 21st century, it is often also called a dot by young people...
, thus, an :e prefix denotes an end tag:
:xmp.Hello, world:exmp.
. According to the reference syntax, letter-case (upper- or lower-) is not distinguished in tag names, thus the three tags: (i) <quote>
, (ii) <QUOTE>
, and (iii) <quOtE>
are equivalent. NOTE: A concrete syntax might change this rule via the NAMECASE NAMING declarations).Markup Minimization
SGML has features for reducing the number of characters required to mark up a document, which must be enabled in the SGML Declaration. SGML processors need not support every available feature, thus allowing applications to tolerate many types of inadvertent markup omissions; however, SGML systems usually are intolerant of invalid structures. XML is intolerant of syntax omissions, and does not require a DTD for validation.OMITTAG
DTDs specify whether or not a markup element's start- or end- tags might be omitted; SGML has rules for implying omitted tags, the OMITTAG feature. If a tag must be paired or not (as in the previous<QUOTE></QUOTE>
pair example) or if it can occur singly (as an HTML <HR>
), those specifications are defined in the DTD for the document (provided the OMITTAG feature is enabled). In this case, the XMLXML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
counterpart would be the specific empty tag
<hr/>
, equivalent to the SGML NET-enabling start-tag, introduced in the TC2 (International Standard ISO 8879:1986, Technical Corrigendum 2, November 1999).SHORTREF
Tags can be replaced with delimiter strings, for a terser markup, via the SHORTREF feature. This markup style is now associated with WikiWiki
A wiki is a website that allows the creation and editing of any number of interlinked web pages via a web browser using a simplified markup language or a WYSIWYG text editor. Wikis are typically powered by wiki software and are often used collaboratively by multiple users. Examples include...
markup, e.g. wherein two equals-signs (
), at the start of a line, are the “heading start-tag”, and two equals signs (
) after that are the “heading end-tag”.SHORTTAG
SGML markup languages whose concrete syntax enables the SHORTTAG VALUE feature, do not require attribute values containing only alphanumeric characters to be enclosed within quotation marks—either double“ ”
(LIT) or single ’ ’
(LITA)—so that the previous markup example could be written:One feature of SGML markup languages is the "presumptuous empty tagging", such that the empty end tag
</>
in <ITALICS>this</>
"inherits" its value from the nearest previous full start tag, which, in this example, is <ITALICS>
(in other words, it closes the most recently opened item). The expression is thus equivalent to <ITALICS>this</ITALICS>
.NET
Another feature is the NET (Null End Tag) construction:<ITALICS/this/
, which is structurally equivalent to <ITALICS>this</ITALICS>
.Other features
Additionally, the SHORTTAG NETENABL IMMEDNET feature allows shortening tags surrounding an empty text value, but forbids shortening full tags:can be written as:
Wherein the first slash
Slash (punctuation)
The slash is a sign used as a punctuation mark and for various other purposes. It is now often called a forward slash , and many other alternative names.-History:...
( / ) stands for the NET-enabling “start-tag close” (NETSC), and the second slash stands for the NET. NOTE: XML defines NETSC with a / , and NET with an > (angled bracket)—hence the corresponding construct in XML appears as
.
The third feature is 'text on the same line', allowing a markup item to be ended with a line-end; especially useful for headings and such, requiring using either SHORTREF or DATATAG minimization. For example, if the DTD includes the following declarations:
(and "RE;RS;" is a short-reference delimiter in the concrete syntax), then:
is equivalent to:
Formal characterization
SGML has many features that defied convenient description with the popular formal automata theoryAutomata theory
In theoretical computer science, automata theory is the study of abstract machines and the computational problems that can be solved using these machines. These abstract machines are called automata...
and the contemporary parser
Parsing
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a text, made of a sequence of tokens , to determine its grammatical structure with respect to a given formal grammar...
technology of the 1980s and the 1990s. The standard warns in Annex H:
A report on an early implementation of a parser for basic SGML, the Amsterdam SGML Parser, notes and specifies various differences.
There appears to be no definitive classification of full SGML against a known class of formal grammar
Formal grammar
A formal grammar is a set of formation rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax...
. Plausible classes may include tree-adjoining grammar
Tree-adjoining grammar
Tree-adjoining grammar is a grammar formalism defined by Aravind Joshi. Tree-adjoining grammars are somewhat similar to context-free grammars, but the elementary unit of rewriting is the tree rather than the symbol...
s and adaptive grammar
Adaptive grammar
An adaptive grammar is a formal grammar that explicitly provides mechanisms within the formalism to allow its own production rules to be manipulated.-Overview:John N...
s.
XML is described as being generally parsable like a two-level grammar
Two-level grammar
A two-level grammar is a formal grammar that is used to generate another formal grammar , such as one with an infinite rule set . This is how a Van Wijngaarden grammar was used to specify Algol68 . A context free grammar that defines the rules for a second grammar can yield an effectively infinite...
for non-validated XML and a Conway
Melvin Conway
Melvin Edward Conway was an early computer scientist, computer programmer, and hacker who coined what's now known as Conway's Law: "Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations."Apart from the above,...
-style pipeline of coroutines (lexer
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...
, parser, validator) for valid XML. The SGML productions in the ISO standard are reported to be LL(3) or LL(4). XML-class subsets are reported to be expressible using a W-grammar. According to one paper, and probably considered at an information set
Information set
In game theory, an information set is a set that, for a particular player, establishes all the possible moves that could have taken place in the game so far, given what that player has observed. If the game has perfect information, every information set contains only one member, namely the point...
or parse tree
Parse tree
A concrete syntax tree or parse tree or parsing treeis an ordered, rooted tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the...
level rather than a character or delimiter level:
The SGML standard does not define SGML with formal data structures, such as parse tree
Parse tree
A concrete syntax tree or parse tree or parsing treeis an ordered, rooted tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the...
s, however, an SGML document is constructed of a rooted directed acyclic graph
Directed graph
A directed graph or digraph is a pair G= of:* a set V, whose elements are called vertices or nodes,...
(RDAG) of physical storage units known as “entities
SGML entity
In the Standard Generalized Markup Language , an entity is a primitive data type, which associates a string with either a unique alias or an SGML reserved word . Entities are foundational to the organizational structure and definition of SGML documents...
”, which is parsed into a RDAG of structural units known as “elements”. The physical graph is loosely characterized as an entity tree, but entities might appear multiple times. Moreover, the structure graph is also loosely characterized as an element tree, but the ID/IDREF markup allows arbitrary arcs.
The results of parsing can also be understood as a data tree in different notations; where the document is the root node, and entities in other notations (text, graphics) are child nodes. SGML provides apparatus for linking to and annotating external non-SGML entities.
The SGML standard describes it in terms of maps and recognition modes (s9.6.1). Each entity, and each element, can have an associated notation or declared content type, which determines the kinds of references and tags which will be recognized in that entity and element. Also, each element can have an associated delimiter map (and short reference map), which determines which characters are treated as delimiters in context. The SGML standard characterizes parsing as a state machine switching between recognition modes. During parsing, there is a stack of maps that configure the scanner
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...
, while the tokenizer
Lexical analysis
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. A program or function which performs lexical analysis is called a lexical analyzer, lexer or scanner...
relates to the recognition modes.
Parsing involves traversing the dynamically-retrieved entity graph, finding/implying tags and the element structure, and validating those tags against the grammar. An unusual aspect of SGML is that the grammar (DTD) is used both passively — to recognize lexical structures, and actively — to generate missing structures and tags that the DTD has declared optional. End- and start- tags can be omitted, because they can be inferred. Loosely, a series of tags can be omitted only if there is a single, possible path in the grammar to imply them. It was this active use of grammars that made concrete SGML parsing difficult to formally characterize.
SGML uses the term validation for both recognition and generation. XML does not use the grammar (DTD) to change delimiter maps or to inform the parse modes, and does not allow tag omission; consequently, XML validation of elements is not active in the sense that SGML validation is active. SGML without a DTD (e.g. simple XML), is a grammar or a language; SGML with a DTD is a metalanguage
Metalanguage
Broadly, any metalanguage is language or symbols used when language itself is being discussed or examined. In logic and linguistics, a metalanguage is a language used to make statements about statements in another language...
. SGML with an SGML declaration is, perhaps, a meta-metalanguage, since it is a metalanguage whose declaration mechanism is a metalanguage.
SGML has an abstract syntax implemented by many possible concrete syntaxes, however, this is not the same usage as in an abstract syntax tree
Abstract syntax tree
In computer science, an abstract syntax tree , or just syntax tree, is a tree representation of the abstract syntactic structure of source code written in a programming language. Each node of the tree denotes a construct occurring in the source code. The syntax is 'abstract' in the sense that it...
and as in a concrete syntax tree
Parse tree
A concrete syntax tree or parse tree or parsing treeis an ordered, rooted tree that represents the syntactic structure of a string according to some formal grammar. In a parse tree, the interior nodes are labeled by non-terminals of the grammar, while the leaf nodes are labeled by terminals of the...
. In the SGML usage, a concrete syntax is a set of specific delimiters, while the abstract syntax is the set of names for the delimiters. The XML Infoset corresponds more to the programming language notion of abstract syntax introduced by John McCarthy
John McCarthy (computer scientist)
John McCarthy was an American computer scientist and cognitive scientist. He coined the term "artificial intelligence" , invented the Lisp programming language and was highly influential in the early development of AI.McCarthy also influenced other areas of computing such as time sharing systems...
.
XML
The W3C XML (Extensible Markup Language) is a profile (subset) of SGML designed to ease the implementation of the parser compared to a full SGML parser, primarily for use on the World Wide Web. In addition to disabling many SGML options present in the reference syntax (such as omitting tags and nested subdocuments) XML adds a number of additional restrictions on the kinds of SGML syntax. For example, despite enabling SGML shortened tag forms, XML does not allow unclosed start or end tags. It also relied on many of the additions made by the WebSGML Annex. XML currently is more widely used than full SGML. XML has lightweight internationalizationInternationalization
In economics, internationalization has been viewed as a process of increasing involvement of enterprises in international markets, although there is no agreed definition of internationalization or international entrepreneurship...
based on Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
. Applications of XML include XHTML
XHTML
XHTML is a family of XML markup languages that mirror or extend versions of the widely-used Hypertext Markup Language , the language in which web pages are written....
, XQuery
XQuery
- Features :XQuery provides the means to extract and manipulate data from XML documents or any data source that can be viewed as XML, such as relational databases or office documents....
, XSLT
XSLT
XSLT is a declarative, XML-based language used for the transformation of XML documents. The original document is not changed; rather, a new document is created based on the content of an existing one. The new document may be serialized by the processor in standard XML syntax or in another format,...
, XForms
XForms
XForms is an XML format for the specification of a data processing model for XML data and user interface for the XML data, such as web forms...
, XPointer
XPointer
XPointer is a system for addressing components of XML based internet media.XPointer is divided among four specifications: a "framework" which forms the basis for identifying XML fragments, a positional element addressing scheme, a scheme for namespaces, and a scheme for XPath-based addressing...
, JSP
JavaServer Pages
JavaServer Pages is a Java technology that helps software developers serve dynamically generated web pages based on HTML, XML, or other document types...
, SVG, RSS
RSS (file format)
RSS is a family of web feed formats used to publish frequently updated works—such as blog entries, news headlines, audio, and video—in a standardized format...
, Atom
Atom (standard)
The name Atom applies to a pair of related standards. The Atom Syndication Format is an XML language used for web feeds, while the Atom Publishing Protocol is a simple HTTP-based protocol for creating and updating web resources.Web feeds allow software programs to check for updates published on a...
, XML-RPC
XML-RPC
XML-RPC is a remote procedure call protocol which uses XML to encode its calls and HTTP as a transport mechanism. "XML-RPC" also refers generically to the use of XML for remote procedure call, independently of the specific protocol...
, RDF/XML
RDF/XML
RDF/XML is a syntax, defined by the W3C, to express an RDF graph as an XML document. According to the W3C, "RDF/XML is the normative syntax for writing RDF"....
, and SOAP.
HTML
While HTML was developed partially independently and in parallel with SGML, its creator Tim Berners-LeeTim Berners-Lee
Sir Timothy John "Tim" Berners-Lee, , also known as "TimBL", is a British computer scientist, MIT professor and the inventor of the World Wide Web...
, intended it to be an application of SGML. The design of HTML (Hyper Text Markup Language) was therefore inspired by SGML tagging, but, since no clear expansion and parsing guidelines were established, most actual HTML documents are not valid SGML documents. Later, HTML was reformulated (version 2.0) to be more of an SGML application, however, the HTML markup language has many legacy- and exception- handling features that differ from SGML's requirements. HTML 4 is an SGML application that fully conforms to ISO 8879 – SGML.
The charter for the recently revived World Wide Web Consortium
World Wide Web Consortium
The World Wide Web Consortium is the main international standards organization for the World Wide Web .Founded and headed by Tim Berners-Lee, the consortium is made up of member organizations which maintain full-time staff for the purpose of working together in the development of standards for the...
HTML Working Group says, "the Group will not assume that an SGML parser is used for 'classic HTML'". Although HTML syntax closely resembles SGML syntax with the default reference concrete syntax, HTML5 (reportedly) abandons conforming with SGML, explicitly defining its own serialization, although, it also defines an alternative XML-based XHTML serialization, which does conform to SGML (WWW).
OED
The second edition of the Oxford English DictionaryOxford English Dictionary
The Oxford English Dictionary , published by the Oxford University Press, is the self-styled premier dictionary of the English language. Two fully bound print editions of the OED have been published under its current name, in 1928 and 1989. The first edition was published in twelve volumes , and...
(OED) is entirely marked up with an SGML-esque document markup language.
Others
Other document markup languages are partly related to SGML and XML, but — because they cannot be parsed or validated or other-wise processed using standard SGML and XML tools — they are not considered either SGML or XML languages; the Z FormatZ Format
The Z format is an open, and freely available document and typesetting language, coded by David H. Kristensen and Dan Ponte from the Z Initiative....
markup language for typesetting and documentation, is an example.
Several modern programming languages support tags as primitive token types, or now support Unicode and regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...
pattern-matching. An example is the Scala programming language.
Applications
Document markup languages defined using SGML are called "applications" by the standard; many pre-XML SGML applications were proprietary property of the organizations which developed them, and thus unavailable in the World Wide Web. The following list is of pre-XML SGML applications.- TEIText Encoding InitiativeThe Text Encoding Initiative is a text-centric community of practice in the academic field of digital humanities. The community runs a mailing list, meetings and conference series, and maintains a technical standard, a wiki and a toolset....
(Text Encoding Initiative) is an academic consortium that designs, maintains, and develops technical standards for digital-format textual representation applications. - DocBookDocBookDocBook is a semantic markup language for technical documentation. It was originally intended for writing technical documents related to computer hardware and software but it can be used for any other sort of documentation....
is a markup language originally created as an SGML application, designed for authoring technical documentation; DocBook currently is an XML application. - CALSCALS (DOD)CALS is a DOD initiative for electronically capturing military documentation and linking related information.The initiative has developed a number of standard specifications...
(Continuous Acquisition and Life-cycle Support) is a US Department of Defense (DoD) initiative for electronically capturing military documents and for linking related data and information. - EDGAREDGAREDGAR, the Electronic Data-Gathering, Analysis, and Retrieval system, performs automated collection, validation, indexing, acceptance, and forwarding of submissions by companies and others who are required by law to file forms with the U.S. Securities and Exchange Commission...
(Electronic Data-Gathering, Analysis, and Retrieval) system effects automated collection, validation, indexing, acceptance, and forwarding of submissions, by companies and others, who are legally required to file data and information forms with the US Securities and Exchange Commission (SEC). - LinuxDocLinuxDocLinuxDoc is an SGML DTD which is similar to DocBook. It was created by Matt Welsh and version 1.1 was announced in 1994. It is primarily used by the Linux Documentation Project. The DocBook SGML tags are often longer than the equivalent LinuxDoc tags...
. Documentation for Linux packages has used the LinuxDoc SGML DTD and Docbook XML DTD.
Open source implementations
Significant open source implementations of SGML have included:- ASP-SGML
- ARC-SGML, by Standard Generalized Markup Language Users', 1991, C language
- SGMLS, by James Clark, 1993, C language
- Project YAO, by Yuan-ze Institute of Technology, Taiwan, with Charles Goldfarb, 1994, object
- SP by James Clark, C++ language
SP and Jade, the associated DSSSL processors, are maintained by the OpenJade project, and are common parts of Linux distributions. A general archive of SGML software and materials resides at SUNET. The original HTML parser class, in Sun System's implementation of Java, is a limited-features SGML parser, using SGML terminology and concepts.
See also
- S-ExpressionS-expressionS-expressions or sexps are list-based data structures that represent semi-structured data. An S-expression may be a nested list of smaller S-expressions. S-expressions are probably best known for their use in the Lisp family of programming languages...
- DSSSL – a SchemeScheme programming languageScheme is one of the two main dialects of the programming language Lisp. Unlike Common Lisp, the other main dialect, Scheme follows a minimalist design philosophy specifying a small standard core with powerful tools for language extension. Its compactness and elegance have made it popular with...
-based processing language similar to XSLExtensible Stylesheet LanguageIn computing, the term Extensible Stylesheet Language is used to refer to a family oflanguages used to transform and render XML documents.... - LaTeXLaTeXLaTeX is a document markup language and document preparation system for the TeX typesetting program. Within the typesetting system, its name is styled as . The term LaTeX refers only to the language in which documents are written, not to the editor used to write those documents. In order to...
- List of general purpose markup languages
- Markup languageMarkup languageA markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...
- SGML entitySGML entityIn the Standard Generalized Markup Language , an entity is a primitive data type, which associates a string with either a unique alias or an SGML reserved word . Entities are foundational to the organizational structure and definition of SGML documents...
- HyTimeHyTimeHyTime is a markup language that is an "application" of SGML. HyTime defines a set of hypertext-oriented element types that, in effect, supplement SGML and allow SGML document authors to build hypertext and multimedia presentations in a standardized way.HyTime is an international standard...
External links
- Overview of SGML Resources at W3C's website.
- Introduction and Examples of Software Documentation in SGML
- SC34 Committee Records, Charles Babbage InstituteCharles Babbage InstituteThe Charles Babbage Institute is a research center at the University of Minnesota specializing in the history of information technology, particularly the history since 1935 of digital computing, programming/software, and computer networking....
– Collection on the development of SGML and other standards influential in the development of current XML tools; documents include early drafts of SGML administrative materials, documentation, working group papers, and standards for computer languages. - A gentle introduction to SGML
- SGML Syntax Summary by Charles Goldfarb
- SGML document introducing you to SGML; Some reasons why SGML is important
- The SGML Declaration, in SGML and HTML Explained, Martin Bryan (1997)
- SGML Declarations Wayne Wohler, IBM Corporation, 1994.
- ISO 9069:1988 – Information processing – SGML support facilities – SGML Document Interchange Format (SDIF)
- ISO/IEC 9070:1991 – Information technology – SGML support facilities – Registration procedures for public text owner identifiers