Space (punctuation)
Encyclopedia
In writing
, a space ( ) is a blank area devoid of content, serving to separate words, letters, numbers, and punctuation. Conventions for interword
and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex.
In the classical period
, Latin was written with interpunct
s (centred dots) as word separators, but that practice was abandoned sometime around the year AD 200 in favour of scriptio continua
, i.e., with the words running together without any word separators. In around AD 600–800, blank spaces started being inserted between words in Latin, and that practice carried over to all languages using the Latin alphabet (e.g. English). In typesetting, spaces have historically been of multiple lengths with particular space-lengths being used for specific typographic purposes, such as separating words or separating sentences or separating punctuation from words. Following the invention of the typewriter and the subsequent overlap of designer style-preferences and computer-technology limitations, much of this reader-centric variation has been lost in normal use.
In computer representation of text, spaces of various sizes, styles, or language characteristics (different space characters) are indicated with unique code point
s.
until roughly 600–800 CE. Ancient Hebrew and Arabic did use spaces, partly to compensate in clarity for the lack of vowels. Traditionally, all CJK
languages have no spaces: modern Chinese
and Japanese
(except when written with little or no kanji
) still do not, but modern Korean
uses spaces.
There has been some controversy regarding the proper amount of sentence spacing in typeset material. The Elements of Typographic Style states that only a single word space is required for sentence spacing since "Larger spaces...are themselves punctuation."
s, there is a normal general-purpose space (Unicode
character ; 32 decimal) whose width will vary according to the design of the typeface. Typical values range from 1/5-em to 1/3-em (in digital typography an em
is equal to the nominal size of the font, so for a 10-point font the space will probably be between 2 and 3.3 points). Sophisticated fonts may have differently sized spaces for bold, italic, and small-caps faces, and often compositors will manually adjust the width of the space depending on the size and prominence of the text.
In addition to this general-purpose space, it is possible to encode a space of a specific width. See the table below for a complete list.
In monospaced proofreading
copy
, only em- and en-spaces are represented using this character (which is called an em-quad or an en-quad), while other types of spaces are represented with a number sign
(see Number sign#Space in particular).
, (160 decimal), is intended to render the same as a normal space but prevents line-wrapping at that position.
However, there are programs which do not follow this intent exactly, for example even such a modern and widespread web browser like the Mozilla Firefox
3.5 series, released in 2009. It (correctly) suppresses the line-wrapping when rendering the non-breaking space, but it (incorrectly) ignores the
property for non-breaking spaces. This was corrected in Firefox version 3.6, released in 2010. Other programs may also suffer from the same flaw. The following simple HTML code demonstrates this flaw on affected browsers:
Here are the above two paragraphs rendered in your current browser:
The generic Unicode space is often considered insignificant when appearing at the end of a line of text, or when part of a sequence of whitespace characters, so it may be omitted or "collapsed" in such circumstances. The non-breaking space is expressly non-collapsible and may be used to indent text, though best World Wide Web
practice prescribes using CSS
for this purpose.
, 6.80, 6.83–86). However, an em dash can optionally be surrounded with a so-called hair space, (8202 decimal), or thin space, (8201 decimal). The thin space can be written in HTML by using the named entity
defines several space characters with specific semantics and rendering characteristics, as shown in the table below. Depending on the browser and fonts used to view this table, not all spaces may display properly:
Unicode also provides some visible characters to stand in for space when necessary:
syntax, spaces are frequently used to explicitly separate tokens. Aside from this use, spaces and other whitespace character
s are usually ignored by modern programming languages. Exceptions are Haskell
, occam, ABC
, and Python
, which use the amount of whitespace in indentation to indicate the bounds of a block, and a whimsical language called Whitespace
, where whitespace is the only meaningful syntactical element.
In commands processed by command processors, e.g. in scripts and typed in, the space character can cause problems as it has two possible functions: as part of a command or parameter, or as a parameter or name separator
. Ambiguity can be prevented either by prohibiting embedded spaces, or by enclosing a name with embedded spaces between quote characters.
Text editors, word processor
s, and desktop publishing software differ in how they represent whitespace on the screen, and how they represent spaces at the ends of lines longer than the screen or column width. In some cases, spaces are shown simply as blank space; in other cases they may be represented by an interpunct
or other symbols. Many different characters (described below) could be used to produce spaces, and non-character functions (such as margins and tab settings) can also affect whitespace.
However, special-purpose markup languages may. In particular, web markup languages such as XML
and HTML
treat whitespace characters specially, including space characters, for programmers' convenience. One or more space characters read by conforming Display-time processors of those markup language
s are collapsed to 0 or 1 space, depending on their semantic context. For example, double (or more) spaces within text are collapsed to a single space, and spaces which appear on either side of the "
Writing
Writing is the representation of language in a textual medium through the use of a set of signs or symbols . It is distinguished from illustration, such as cave drawing and painting, and non-symbolic preservation of language via non-textual media, such as magnetic tape audio.Writing most likely...
, a space ( ) is a blank area devoid of content, serving to separate words, letters, numbers, and punctuation. Conventions for interword
Interword separation
In punctuation, a word divider is a glyph that separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other languages of Europe and the Mideast, the word divider is a blank space, or whitespace, a convention which is spreading, along with other aspects...
and intersentence spaces vary among languages, and in some cases the spacing rules are quite complex.
In the classical period
Classical antiquity
Classical antiquity is a broad term for a long period of cultural history centered on the Mediterranean Sea, comprising the interlocking civilizations of ancient Greece and ancient Rome, collectively known as the Greco-Roman world...
, Latin was written with interpunct
Interpunct
An interpunct —also called an interpoint—is a small dot used for interword separation in ancient Latin script, which also appears in some modern languages as a stand-alone sign inside a word. It is present in Unicode as code point ....
s (centred dots) as word separators, but that practice was abandoned sometime around the year AD 200 in favour of scriptio continua
Scriptio continua
Scriptio continua is a style of writing without spaces or other marks between words or sentences....
, i.e., with the words running together without any word separators. In around AD 600–800, blank spaces started being inserted between words in Latin, and that practice carried over to all languages using the Latin alphabet (e.g. English). In typesetting, spaces have historically been of multiple lengths with particular space-lengths being used for specific typographic purposes, such as separating words or separating sentences or separating punctuation from words. Following the invention of the typewriter and the subsequent overlap of designer style-preferences and computer-technology limitations, much of this reader-centric variation has been lost in normal use.
In computer representation of text, spaces of various sizes, styles, or language characteristics (different space characters) are indicated with unique code point
Code point
In character encoding terminology, a code point or code position is any of the numerical values that make up the code space . For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112...
s.
Spaces between words
Modern English uses a space to separate words, but not all languages follow this practice. Spaces were not used to separate words in LatinLatin
Latin is an Italic language originally spoken in Latium and Ancient Rome. It, along with most European languages, is a descendant of the ancient Proto-Indo-European language. Although it is considered a dead language, a number of scholars and members of the Christian clergy speak it fluently, and...
until roughly 600–800 CE. Ancient Hebrew and Arabic did use spaces, partly to compensate in clarity for the lack of vowels. Traditionally, all CJK
CJK
CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :...
languages have no spaces: modern Chinese
Chinese language
The Chinese language is a language or language family consisting of varieties which are mutually intelligible to varying degrees. Originally the indigenous languages spoken by the Han Chinese in China, it forms one of the branches of Sino-Tibetan family of languages...
and Japanese
Japanese language
is a language spoken by over 130 million people in Japan and in Japanese emigrant communities. It is a member of the Japonic language family, which has a number of proposed relationships with other languages, none of which has gained wide acceptance among historical linguists .Japanese is an...
(except when written with little or no kanji
Kanji
Kanji are the adopted logographic Chinese characters hanzi that are used in the modern Japanese writing system along with hiragana , katakana , Indo Arabic numerals, and the occasional use of the Latin alphabet...
) still do not, but modern Korean
Korean language
Korean is the official language of the country Korea, in both South and North. It is also one of the two official languages in the Yanbian Korean Autonomous Prefecture in People's Republic of China. There are about 78 million Korean speakers worldwide. In the 15th century, a national writing...
uses spaces.
Spaces between sentences
There have been various methods of sentence spacing in languages with a Latin-derived alphabet since the advent of movable type in the 15th century.- One space (French SpacingFrench spacingSentence spacing is the horizontal space between sentences in typeset text. It is a matter of typographical convention. Since the introduction of movable-type printing in Europe, various sentence spacing conventions have been used in languages with a Latin-derived alphabet...
). This is the current convention in countries that use the modern Latin alphabet for published and final written work, as well as digital (World Wide Web) media.
- Double space (English Spacing). This convention stems from the use of the monospaced font on typewriters. This historical convention was carried on by tradition until it was replaced by the single space convention in published print and digital media today.
- One widened space, typically one-and-a-third to slightly less than two times wider than a word space. This spacing has been seen in early typesetting practices (prior to the nineteenth century). It has also been used in other non-typewriter typesetting systems such as the Linotype machineLinotype machineThe Linotype typesetting machine is a "line casting" machine used in printing. The name of the machine comes from the fact that it produces an entire line of metal type at once, hence a line-o'-type, a significant improvement over manual typesetting....
and the TeXTeXTeX is a typesetting system designed and mostly written by Donald Knuth and released in 1978. Within the typesetting system, its name is formatted as ....
system. Modern computer-based digital fonts can adjust the spacing after terminal punctuation as well, creating a space slightly wider than a standard word space.
- No space. According to Lynne TrussLynne TrussLynne Truss is an English writer and journalist, best known for her popular book Eats, Shoots & Leaves: The Zero Tolerance Approach to Punctuation.-Early life:...
, "young people" today using digital media "are now accustomed to following a full stop with a lower-case letter and no space".
There has been some controversy regarding the proper amount of sentence spacing in typeset material. The Elements of Typographic Style states that only a single word space is required for sentence spacing since "Larger spaces...are themselves punctuation."
Spaces and unit symbols
In Canadian Style: A Guide to Writing and Editing:- When symbols are used, the prefix symbol and unit symbols are run together:
-
- 5 cm
- 7 hL
- 4 dag
- 13 kPa
- When a symbol consists entirely of letters, leave a full space between the quantity and the symbol:
-
- 45 kg not
- When the symbol includes a non-letter character as well as letter, leave no space:
-
- 32°C not or
- For the sake of clarity, a hyphen may be inserted between a numeral and a symbol used adjectivally:
-
- 35-mm film
- 60-W bulb
Variable-width general-purpose space
In computer character encodingCharacter encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
s, there is a normal general-purpose space (Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
character ; 32 decimal) whose width will vary according to the design of the typeface. Typical values range from 1/5-em to 1/3-em (in digital typography an em
Em (typography)
An em is a unit of measurement in the field of typography, equal to the currently specified point size.The name of em is related to M. Originally the unit was derived from the width of the capital "M" in the given typeface....
is equal to the nominal size of the font, so for a 10-point font the space will probably be between 2 and 3.3 points). Sophisticated fonts may have differently sized spaces for bold, italic, and small-caps faces, and often compositors will manually adjust the width of the space depending on the size and prominence of the text.
In addition to this general-purpose space, it is possible to encode a space of a specific width. See the table below for a complete list.
In monospaced proofreading
Proofreading
Proofreading is the reading of a galley proof or computer monitor to detect and correct production-errors of text or art. Proofreaders are expected to be consistently accurate by default because they occupy the last stage of typographic production before publication.-Traditional method:A proof is...
copy
Copy (written)
Copy refers to written material, in contrast to photographs or other elements of layout, in a large number of contexts, including magazines, advertising, and book publishing....
, only em- and en-spaces are represented using this character (which is called an em-quad or an en-quad), while other types of spaces are represented with a number sign
Number sign
Number sign is a name for the symbol #, which is used for a variety of purposes including, in some countries, the designation of a number...
(see Number sign#Space in particular).
Breaking and non-breaking spaces
By default, computer programs usually assume that, in flowing text, a line break may as necessary be inserted at the position of a space. The non-breaking spaceNon-breaking space
In computer-based text processing and digital typesetting, a non-breaking space or no-break space is a variant of the space character that prevents an automatic line break at its position. In certain formats , it also prevents the “collapsing” of multiple consecutive whitespace characters into a...
, (160 decimal), is intended to render the same as a normal space but prevents line-wrapping at that position.
However, there are programs which do not follow this intent exactly, for example even such a modern and widespread web browser like the Mozilla Firefox
Mozilla Firefox
Mozilla Firefox is a free and open source web browser descended from the Mozilla Application Suite and managed by Mozilla Corporation. , Firefox is the second most widely used browser, with approximately 25% of worldwide usage share of web browsers...
3.5 series, released in 2009. It (correctly) suppresses the line-wrapping when rendering the non-breaking space, but it (incorrectly) ignores the
word-spacing
CSSCSS
-Computing:*Cascading Style Sheets, a language used to describe the style of document presentations in web development*Central Structure Store in the PHIGS 3D API*Closed source software, software that is not distributed with source code...
property for non-breaking spaces. This was corrected in Firefox version 3.6, released in 2010. Other programs may also suffer from the same flaw. The following simple HTML code demonstrates this flaw on affected browsers:
Here are the above two paragraphs rendered in your current browser:
The generic Unicode space is often considered insignificant when appearing at the end of a line of text, or when part of a sequence of whitespace characters, so it may be omitted or "collapsed" in such circumstances. The non-breaking space is expressly non-collapsible and may be used to indent text, though best World Wide Web
World Wide Web
The World Wide Web is a system of interlinked hypertext documents accessed via the Internet...
practice prescribes using CSS
CSS
-Computing:*Cascading Style Sheets, a language used to describe the style of document presentations in web development*Central Structure Store in the PHIGS 3D API*Closed source software, software that is not distributed with source code...
for this purpose.
Hair spaces around dashes
In American typography, both en dashes and em dashes are set continuous with the text (as illustrated by use in the Chicago Manual of StyleThe Chicago Manual of Style
The Chicago Manual of Style is a style guide for American English published since 1906 by the University of Chicago Press. Its 16 editions have prescribed writing and citation styles widely used in publishing...
, 6.80, 6.83–86). However, an em dash can optionally be surrounded with a so-called hair space, (8202 decimal), or thin space, (8201 decimal). The thin space can be written in HTML by using the named entity
 
and the hair space can be written using numeric character referenceNumeric character reference
A numeric character reference is a common markup construct used in SGML and other SGML-related markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set of Unicode...
 
or  
. This space should be much thinner than a normal space, and is seldom used on its own.Normal space | left right |
---|---|
Normal space with em dash | left — right |
Thin space with em dash | left — right |
Hair space with em dash | left — right |
No space with em dash | left—right |
Spaces in Unicode
UnicodeUnicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
defines several space characters with specific semantics and rendering characteristics, as shown in the table below. Depending on the browser and fonts used to view this table, not all spaces may display properly:
Code | Dec | Break Non-breaking space In computer-based text processing and digital typesetting, a non-breaking space or no-break space is a variant of the space character that prevents an automatic line break at its position. In certain formats , it also prevents the “collapsing” of multiple consecutive whitespace characters into a... | URL | HTML | Name | Block Unicode block In Unicode, a block is defined as one contiguous range of code points. Blocks are named uniquely and have no overlap. They may be defined with the starting and ending code points. The block explicitly can include code points that are unassigned and non-characters. Code points not belonging to any... |
Display |
---|---|---|---|---|---|---|---|
U+0020 | 32 | Space | Basic Latin | ||||
Normal space, same as ASCII character 0x20 | |||||||
U+00A0 | 160 | |
No-Break Space | Latin-1 Supplement | |||
Identical to U+0020, but not a point at which a line may be broken | |||||||
U+1680 | 5760 | Ogham | |||||
Used for interword separation Interword separation In punctuation, a word divider is a glyph that separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other languages of Europe and the Mideast, the word divider is a blank space, or whitespace, a convention which is spreading, along with other aspects... in Ogham Ogham Ogham is an Early Medieval alphabet used primarily to write the Old Irish language, and occasionally the Brythonic language. Ogham is sometimes called the "Celtic Tree Alphabet", based on a High Medieval Bríatharogam tradition ascribing names of trees to the individual letters.There are roughly... text. Normally a vertical line in vertical text or a horizontal line in horizontal text, but may also be a blank space in "stemless" fonts. Requires an Ogham font. |
|||||||
U+180E | 6158 | Mongolian Vowel Separator (MVS) | Mongolian | ||||
A narrow space character (not to be confused with "thin space", below) used in Mongolian to cause the final two characters of a word to take on different shapes. | |||||||
U+2000 | 8192 | En quad | General Punctuation | ||||
En Quad is canonically equivalent to U+2002, U+2002 is preferred. | |||||||
U+2001 | 8193 | Mutton quad |
General Punctuation | ||||
Em Quad is canonically equivalent to U+2003. U+2003 is preferred. | |||||||
U+2002 | 8194 |   |
Nut |
General Punctuation | |||
Width of one en En (typography) An en is a typographic unit, half of the width of an em. By definition, it is equivalent to half of the height of the font . As its name suggests, it is also traditionally the width of a lowercase letter "n".... (half of one em Em (typography) An em is a unit of measurement in the field of typography, equal to the currently specified point size.The name of em is related to M. Originally the unit was derived from the width of the capital "M" in the given typeface.... ). U+2000 En Quad is canonically equivalent to this character (En Space is preferred). |
|||||||
U+2003 | 8195 |   |
Mutton |
General Punctuation | |||
Width of one em Em (typography) An em is a unit of measurement in the field of typography, equal to the currently specified point size.The name of em is related to M. Originally the unit was derived from the width of the capital "M" in the given typeface.... . U+2001 Em Quad is canonically equivalent to this character (Em Space is preferred). |
|||||||
U+2004 | 8196 | Thick Space |
General Punctuation | ||||
One third of an em wide | |||||||
U+2005 | 8197 | Mid Space |
General Punctuation | ||||
One fourth of an em wide | |||||||
U+2006 | 8198 | General Punctuation | |||||
One sixth of an em wide. In computer typography sometimes equated to U+2009. | |||||||
U+2007 | 8199 | General Punctuation | |||||
In fonts with monospaced digits, equal to the width of one digit. | |||||||
U+2008 | 8200 | General Punctuation | |||||
As wide as the narrow punctuation in a font, i.e. the advance width of the period or comma. | |||||||
U+2009 | 8201 |   |
General Punctuation | ||||
One fifth (sometimes one sixth) of an em wide. Recommended for use as a thousands separator for measures made with SI units. Unlike U+2002 to U+2008, its width may get adjusted in typesetting. | |||||||
U+200A | 8202 | General Punctuation | |||||
Thinner than a thin space | |||||||
U+200B | 8203 | (ZWSP) | General Punctuation | ||||
Used to indicate word boundaries to text processing systems when using scripts that do not use explicit spacing. | |||||||
U+200C | 8204 | ‌ |
(ZWNJ) | General Punctuation | |||
When placed between two characters that would otherwise be connected, a ZWNJ causes them to be printed in their final and initial forms, respectively. | |||||||
U+200D | 8205 | ‍ |
(ZWJ) | General Punctuation | |||
When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms. | |||||||
U+202F | 8239 | General Punctuation | |||||
Similar in function to U+00A0 No-Break Space. Introduced in Unicode 3.0 for Mongolian, to separate a suffix from the word stem without indicating a word boundary. When used with Mongolian, its width is usually one third of the normal space; in other context, its width resembles that of the Thin Space (U+2009) at least with some fonts. This char is also used in French before ";?!»" chars and after "«". | |||||||
U+205F | 8287 | Medium Mathematical Space (MMSP) | General Punctuation | ||||
Used in mathematical formulae. Four-eighteenths of an em. In mathematical typography, the widths of spaces are usually given in integral multiples of an eighteenth of an em, and 4/18 em may be used in several situations, for example between the a and the + and between the + and the b in the expression a + b. | |||||||
U+2060 | 8288 | (WJ) | General Punctuation | ||||
Identical to U+200B, but not a point at which a line may be broken. Introduced in Unicode 3.2 to replace the deprecated "zero width no-break space" function of the U+FEFF character. | |||||||
U+3000 | 12288 | CJK Symbols and Punctuation | |||||
As wide as a CJK CJK CJK is a collective term for Chinese, Japanese, and Korean, which is used in the field of software and communications internationalization.The term CJKV means CJK plus Vietnamese, which constitute the main East Asian languages.- Characteristics :... character cell (fullwidth) |
|||||||
U+FEFF | 65279 | = Byte Order Mark (BOM) Byte Order Mark The byte order mark is a Unicode character used to signal the endianness of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream... |
Arabic Presentation Forms-B | ||||
Used primarily as a Byte Order Mark Byte Order Mark The byte order mark is a Unicode character used to signal the endianness of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream... . Use as an indication of non-breaking is deprecated as of Unicode 3.2, see U+2060 instead. |
Unicode also provides some visible characters to stand in for space when necessary:
Code | Dec | Name | Block Unicode block In Unicode, a block is defined as one contiguous range of code points. Blocks are named uniquely and have no overlap. They may be defined with the starting and ending code points. The block explicitly can include code points that are unassigned and non-characters. Code points not belonging to any... |
Display | Description |
---|---|---|---|---|---|
U+00B7 | 183 | Middle dot | Basic Latin | · | interpunct Interpunct An interpunct —also called an interpoint—is a small dot used for interword separation in ancient Latin script, which also appears in some modern languages as a stand-alone sign inside a word. It is present in Unicode as code point .... , used in text processors. HTML also: · |
U+237D | 9085 | Shouldered open box | Miscellaneous Technical | ⍽ | used for NBSP |
U+2420 | 9248 | Symbol for space | Control Pictures | ␠ | |
U+2422 | 9250 | Blank symbol | Control Pictures | ␢ | |
U+2423 | 9251 | Open box | Control Pictures | ␣ |
Use of the space in computing
In programming languageProgramming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....
syntax, spaces are frequently used to explicitly separate tokens. Aside from this use, spaces and other whitespace character
Whitespace (computer science)
In computer science, whitespace is any single character or series of characters that represents horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visual mark, but typically does occupy an area on a page...
s are usually ignored by modern programming languages. Exceptions are Haskell
Haskell (programming language)
Haskell is a standardized, general-purpose purely functional programming language, with non-strict semantics and strong static typing. It is named after logician Haskell Curry. In Haskell, "a function is a first-class citizen" of the programming language. As a functional programming language, the...
, occam, ABC
ABC programming language
ABC is an imperative general-purpose programming language and programming environment developed at CWI, Netherlands by Leo Geurts, Lambert Meertens, and Steven Pemberton. It is interactive, structured, high-level, and intended to be used instead of BASIC, Pascal, or AWK...
, and Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
, which use the amount of whitespace in indentation to indicate the bounds of a block, and a whimsical language called Whitespace
Whitespace (programming language)
Whitespace is an esoteric programming language developed by Edwin Brady and Chris Morris at the University of Durham . It was released on 1 April 2003 . Its name is a reference to whitespace characters...
, where whitespace is the only meaningful syntactical element.
In commands processed by command processors, e.g. in scripts and typed in, the space character can cause problems as it has two possible functions: as part of a command or parameter, or as a parameter or name separator
Delimiter
A delimiter is a sequence of one or more characters used to specify the boundary between separate, independent regions in plain text or other data streams. An example of a delimiter is the comma character, which acts as a field delimiter in a sequence of comma-separated values.Delimiters represent...
. Ambiguity can be prevented either by prohibiting embedded spaces, or by enclosing a name with embedded spaces between quote characters.
Text editors, word processor
Word processor
A word processor is a computer application used for the production of any sort of printable material....
s, and desktop publishing software differ in how they represent whitespace on the screen, and how they represent spaces at the ends of lines longer than the screen or column width. In some cases, spaces are shown simply as blank space; in other cases they may be represented by an interpunct
Interpunct
An interpunct —also called an interpoint—is a small dot used for interword separation in ancient Latin script, which also appears in some modern languages as a stand-alone sign inside a word. It is present in Unicode as code point ....
or other symbols. Many different characters (described below) could be used to produce spaces, and non-character functions (such as margins and tab settings) can also affect whitespace.
Space characters in markup languages
Generalised markup languages, such as SGML, do not treat space characters differently from other characters.However, special-purpose markup languages may. In particular, web markup languages such as XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
and HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....
treat whitespace characters specially, including space characters, for programmers' convenience. One or more space characters read by conforming Display-time processors of those markup language
Markup language
A markup language is a modern system for annotating a text in a way that is syntactically distinguishable from that text. The idea and terminology evolved from the "marking up" of manuscripts, i.e. the revision instructions by editors, traditionally written with a blue pencil on authors' manuscripts...
s are collapsed to 0 or 1 space, depending on their semantic context. For example, double (or more) spaces within text are collapsed to a single space, and spaces which appear on either side of the "
=
" that separates an attribute name from its value have no effect on the interpretation of the document. Element end tags can contain trailing spaces, and empty-element tags in XML can contain spaces before the "/>".
In XML attribute values, sequences of whitespace characters are treated as a single space when the document is read by a parser. Whitespace in XML element content is not changed in this way by the parser, but an application receiving information from the parser may choose to apply similar rules to element content. An XML document author can use the xml:space="preserve"
attribute on an element to force the parser to discourage the downstream application from altering whitespace in that element's content.
In most HTML elementHTML elementAn HTML element is an individual component of an HTML document. HTML documents are composed of a tree of HTML elements and other nodes, such as text nodes. Each element can have attributes specified. Elements can also have content, including other elements and text. HTML elements represent...
s, a sequence of whitespace characters is treated as a single inter-word separator, which may manifest as a single space character when rendering text in a language that normally inserts such space between words. Conforming HTML renderers are required to apply a more literal treatment of whitespace within a few prescribed elements, such as the pre
tag and any element for which CSSCascading Style SheetsCascading Style Sheets is a style sheet language used to describe the presentation semantics of a document written in a markup language...
has been used to apply pre
-like whitespace processing. In such elements, space characters will not be "collapsed" into inter-word separators.
In both XML and HTML, the non-breaking spaceNon-breaking spaceIn computer-based text processing and digital typesetting, a non-breaking space or no-break space is a variant of the space character that prevents an automatic line break at its position. In certain formats , it also prevents the “collapsing” of multiple consecutive whitespace characters into a...
character, along with other non-"standard" spaces, is not treated as collapsible "whitespace", so it is not subject to the rules above.
See also
- Hard spaceHard spaceIn typesetting and text editors, the term hard space has several meanings, all related to a special way of representing the space between characters....
- Hyphenation
- Internal field separatorInternal field separatorIn Unix operating systems, internal field separator refers to the character or characters designated as whitespace by the operating system. IFS is actually a system variable, and it can be modified, which is useful programmatically in a number of ways.IFS typically includes the space and the...
- Non-breaking spaceNon-breaking spaceIn computer-based text processing and digital typesetting, a non-breaking space or no-break space is a variant of the space character that prevents an automatic line break at its position. In certain formats , it also prevents the “collapsing” of multiple consecutive whitespace characters into a...
- Paren space
- Sentence spacing
- Zero-width joinerZero-width joinerThe zero-width joiner is a non-printing character used in the computerized typesetting of some complex scripts, such as the Arabic script or any of the Indic scripts. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected...
- Zero-width non-joinerZero-width non-joinerThe zero-width non-joiner is a non-printing character used in the computerization of writing systems that make use of ligatures. When placed between two characters that would otherwise be connected into a ligature, a ZWNJ causes them to be printed in their final and initial forms, respectively...
- For the Chinese one-character space used as respect, see Tai tou
External links
- Unicode spaces, by Jukka "Yucca" Korpela.
- Commonly confused characters