Homoglyph
Encyclopedia
In typography
Typography
Typography is the art and technique of arranging type in order to make language visible. The arrangement of type involves the selection of typefaces, point size, line length, leading , adjusting the spaces between groups of letters and adjusting the space between pairs of letters...

, a homoglyph is one of two or more characters, or glyph
Glyph
A glyph is an element of writing: an individual mark on a written medium that contributes to the meaning of what is written. A glyph is made up of one or more graphemes....

s, with shapes that either appear identical or cannot be differentiated by quick visual inspection. This designation is also applied to sequences of characters sharing these properties.

The antonym is synoglyph, which refers to glyphs that look different but mean the same thing. Synoglyphs are also known as display variants. The term homograph
Homograph
A homograph is a word or a group of words that share the same written form but have different meanings. When spoken, the meanings may be distinguished by different pronunciations, in which case the words are also heteronyms. Words with the same writing and pronunciation A homograph (from the ,...

is sometimes used synonymously with homoglyph, though in the usual linguistic sense homographs are words that are spelled the same but have different meanings – a property of words, not characters.

In 2008, the Unicode Consortium
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 published its Technical Report #36 on a range of issues deriving from the visual similarity of characters both in single scripts, and similarities between characters in different scripts.

Zero and O; one, l and I

Two common and important pairs of homoglyphs in use today are the digit zero and the capital letter O (i.e. 0 & O); and the digit one, the lowercase letter L and the uppercase i (i.e. 1, l & I). In the days of mechanical typewriters there was very little or no visual difference between these glyphs and typists treated them interchangeably as keyboarding shortcuts. In fact, most keyboards did not even have a key for the digit "1", requiring users to type the letter "l" instead, and some also omitted 0. As these same typists transitioned in the 1970s and 1980s to being computer keyboard operators, their old keyboarding habits continued with them in their new profession, and became a source of great confusion.

Most current type designs carefully distinguish between these homoglyphs, usually by drawing the digit zero narrower and by drawing the digit one with prominent serifs. Early computer print-outs went even further and marked the zero with a slash or dot — leading to a new conflict involving the Scandinavian letter "Ø
Ø
Ø — minuscule: "ø", is a vowel and a letter used in the Danish, Faroese, Norwegian and Southern Sami languages.It's mostly used as a representation of mid front rounded vowels, such as ø œ, except for Southern Sami where it's used as an [oe] diphtong.The name of this letter is the same as the sound...

". The re-designing of character types to differentiate these homoglyphs, combined with the passing away of keyboard operators trained on mechanical typewriters, has seen the diminishment of these particular homoglyph errors.

I and l

In addition to resembling the digit 1 in serif fonts (l & 1), lowercase L often resembles capital I in sans-serif fonts (l & I).

Multi-letter homoglyphs

Some other combinations of letters look similar, for instance rn looks similar to m, cl looks similar to d, and vv looks similar to w.

In certain narrow-spaced fonts (such as Tahoma
Tahoma (typeface)
Tahoma is a humanist sans-serif typeface designed by Matthew Carter for the Microsoft Corporation in 1994 with initial distribution along with Verdana for Windows 95....

), placing the letter c next to a letter such as j, l or i will create a homoglyph, such as cj cl ci (g d a).

When some characters are placed next to each other, seen together at a glance they give the visual impression of another, unrelated character. A more precise way of saying this is that some typographic ligatures can look similar to standalone glyphs. For example, the fi ligature () can look similar to A in some typefaces or fonts. This potential for confusion is sometimes an argument made against the use of ligatures.

Unicode homoglyphs

The Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...

 character set contains many strongly homoglyphic characters. These present security risks in a variety of situations (addressed in UTR#36) and have recently been called to particular attention in regard to internationalized domain name
Internationalized domain name
An internationalized domain name is an Internet domain name that contains at least one label that is displayed in software applications, in whole or in part, in a language-specific script or alphabet, such as Arabic, Chinese, Russian, Hindi or the Latin alphabet-based characters with diacritics,...

s. One might deliberately spoof a domain name by substituting one character with its homoglyph, thus creating a second domain name, not readily distinguishable from the first, that can be exploited in phishing
Phishing
Phishing is a way of attempting to acquire information such as usernames, passwords, and credit card details by masquerading as a trustworthy entity in an electronic communication. Communications purporting to be from popular social web sites, auction sites, online payment processors or IT...

 (see main article IDN homograph attack
IDN homograph attack
The internationalized domain name homograph attack is a way a malicious party may deceive computer users about what remote system they are communicating with, by exploiting the fact that many different characters look alike,...

). In many fonts
Typeface
In typography, a typeface is the artistic representation or interpretation of characters; it is the way the type looks. Each type is designed and there are thousands of different typefaces in existence, with new ones being developed constantly....

 the Greek
Greek alphabet
The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...

 letter 'Α', the Cyrillic letter 'А' and the Latin
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...

 letter 'A' are visually identical, as are the Latin letter 'a' and the Cyrillic letter 'а'. A domain name can be spoofed simply by substituting one of these forms for another in a separately registered name. There are also many examples of near-homoglyphs within the same script such as 'í' (with an acute accent) and 'i', É (E-acute) and Ė (E dot above) and È (E-grave), Í (with an acute accent) and ĺ (Lowercase L with acute). When discussing this specific security issue, any two sequences of similar characters may be assessed in terms of its potential to be taken as a 'homoglyph pair', or if the sequences clearly appear to be words, as 'pseudo-homographs' (noting again that these terms may themselves cause confusion in other contexts).

Efforts are underway by TLD registries and Web browser
Web browser
A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information resource is identified by a Uniform Resource Identifier and may be a web page, image, video, or other piece of content...

 designers to minimize the risks of homoglyphic confusion to the fullest extent possible. Relevant documentation will be found both on the developers' Web sites, and on an IDN Forum provided by ICANN
ICANN
The Internet Corporation for Assigned Names and Numbers is a non-profit corporation headquartered in Marina del Rey, California, United States, that was created on September 18, 1998, and incorporated on September 30, 1998 to oversee a number of Internet-related tasks previously performed directly...

.

A manifestation of homoglyphic confusion in a historical regard results from the use of a 'y' to represent a 'þ' when setting older English texts in typefaces that do not contain the latter character. This has led in modern times to such phenomena as Ye olde
Ye Olde
Ye Olde is a pseudo-Early Modern English stock prefix, used anachronistically, suggestive of a Deep England feel.A typical example would be Ye Olde English Pubbe or similar names of theme pubs....

 shoppe
– implying incorrectly that the word the was formerly written ye (and pronounced jiː). For further discussion see thorn
Thorn (letter)
Thorn or þorn , is a letter in the Old English, Old Norse, and Icelandic alphabets, as well as some dialects of Middle English. It was also used in medieval Scandinavia, but was later replaced with the digraph th. The letter originated from the rune in the Elder Fuþark, called thorn in the...

.

External links

  • homoglyphs.net - reference table on Unicode homoglyphs to Latin characters and online tool for generating homographs from these.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK