IDN homograph attack
Encyclopedia
The internationalized domain name (IDN) homograph attack is a way a malicious party may deceive computer users about what remote system they are communicating with, by exploiting the fact that many different characters look alike, (i.e., they are homograph
s, hence the term for the attack). For example, a person frequenting citibank
.com may be lured to click the link [сitibank.com] (punycode
: xn--itibank-xjg.com/) where the Latin
C
is replaced with the Cyrillic
С.
This kind of spoofing attack
is also known as script spoofing. Unicode
incorporates numerous writing systems, and, for a number of reasons, similar-looking characters such as Greek Ο, Latin O
, and Cyrillic О were not assigned the same code. Their incorrect or malicious usage is a possibility for security attacks.
The registration of homographic domain names is akin to typosquatting
. The major difference is that in typosquatting the perpetrator relies on natural human typos, while in homograph spoofing the perpetrator intentionally deceives the web surfer with visually indistinguishable names. Indeed, it would be a rare accident for a web user to type, e.g., a Cyrillic letter within an otherwise English word such as "citibank". There are cases in which a registration can be both typosquatting and homograph spoofing; the pairs of l/I, i/j, and 0/O are all both close together on keyboards and bear a certain amount of resemblance to each other.
s in the pre-computer era even conflated
the ell and the one; users had to type a lowercase L when the number one was needed. The zero/oh confusion gave rise to the tradition of crossing zeros
, so that a computer operator
would type them correctly. Unicode may contribute to this greatly with its combining characters, accents, several types of hyphen
-alikes, etc., often due to inadequate rendering
support, especially with smaller fonts sizes and wide variety of fonts.
Even earlier, handwriting
provided rich opportunities for confusion. A notable example is the etymology of the word "zenith
". The translation from the Arabic "samt" included the scribe's confusing of "m" into "ni". This was common in medieval blackletter
, which did not connect the vertical columns on the letters i, m, n, or u, making them difficult to distinguish when several were in a row. The latter, as well as "rn"/"m"/"rri" ("RN"/"M"/"RRI") confusion, is still possible for a human eye even with modern advanced computer technology.
Intentional look-alike character substitution with different alphabets has also been known in various contexts. For example, Faux Cyrillic
has been used as an amusement or attention-grabber and "Volapuk encoding
" was used in early days of the Internet
as a way to overcome the lack of support for the Cyrillic alphabet.
s (a subgroup of homograph
s). Spoofing attack
s based on these similarities are known as homograph spoofing attacks. For example 0 (the number) and O (the letter), "l" lowercase L, and "I" uppercase "i".
In a typical example of a hypothetical attack, someone could register a domain name
that appears almost identical to an existing domain but goes somewhere else. For example, the domain "rnicrosoft.com" contains "r" and "n", not "m".
Other examples are G00GLE.COM which looks much like GOOGLE.COM in some fonts.
Using a mix of uppercase and lowercase characters, googIe.com (capital i, not small L) looks much like google.com in some fonts. PayPal
was a target of a phishing scam exploiting this, using the domain PayPaI.com
. In certain narrow-spaced fonts such as Tahoma
(the default address bar in Windows XP
), placing a c in front of a j, l or i will produce homoglyphs such as cl cj ci (d g a).
For example, Unicode
character U+0430, Cyrillic
small letter a ("а"), can look identical to Unicode character U+0061, Latin
small letter a, ("a") which is the lowercase "a" used in English.
The problem arises from the different treatment of the characters in the user's mind and the computer's programming. From the viewpoint of the user, a Cyrillic "а" within a Latin string is a Latin "a"; there is no difference in the glyphs for these characters in most fonts. However, the computer treats them differently when processing the character string as an identifier. Thus, the user's assumption of a one-to-one correspondence between the visual appearance of a name, and the named entity, breaks down.
Internationalized domain name
s provide a backward-compatible way for domain names to use the full Unicode character set, and this standard is already widely supported. However this system expanded the character repertoire from a few dozen characters in a single alphabet to many thousands of characters in many scripts; this greatly increased the scope for homograph attacks.
This opens a rich vein of opportunities for phishing
and other varieties of fraud. An attacker could register a domain name that looks just like that of a legitimate website, but in which some of the letters have been replaced by homographs in another alphabet. The attacker could then send e-mail messages purporting to come from the original site, but directing people to the bogus site. The spoof site could then record information such as passwords or account details, while passing traffic through to the real site. The victims may never notice the difference, until suspicious or criminal activity occurs with their accounts.
In December 2001 Evgeniy Gabrilovich
and Alex Gontmakher, both from Technion, Israel
, published a paper titled "The Homograph Attack", which described an attack that used Unicode URLs to spoof a website URL. To prove the feasibility of this kind of attack, the researchers successfully registered a variant of the domain name microsoft
.com which incorporated Russian language characters.
These kind of problems were anticipated before IDN was introduced, and guidelines were issued to registries to try to avoid or reduce the problem. For example, it was advised that registries only accept characters from the Latin alphabet and that of their own country, not all of Unicode characters, but this advice was neglected by major TLD
s.
On February 7, 2005, Slashdot
reported that this exploit was disclosed by 3ric Johanson at the hacker
conference Shmoocon
. Web browsers supporting IDNA appeared to direct the URLhttp://www.pаypal.com/ , in which the first a character is replaced by a Cyrillic а, to the site of the well known payment site Paypal
, but actually led to a spoofed web site with different content.
The following alphabets have characters that can be used for spoofing attacks (please note, these are only the most obvious and common, given artistic license and how much risk the spoofer will take of getting caught; the possibilities are far more numerous than can be listed here):
The Russian letters
а, с, е, о, р, х and у have optical counterparts in the basic Latin alphabet and look close or identical to a
, c
, e
, o
, p
, x
and y
. Cyrillic З, Ч and б resemble the numerals 3, 4 and 6. Italic type
generates more homoglyphs: дтпи (дтпи in standard type), resembling g
m
n
u
(though in most standard fonts д instead resembles a partial differential sign, ∂).
If capital letters are counted, АВСЕНІЈКМОРЅТХ can substitute A
B
C
E
H
I
J
K
M
O
P
S
T
X
, in addition to the capitals for the lowercase Cyrillic homoglyphs. In the Serbian alphabet and handwritten based fonts, Cyrillic Д and Latin D are homoglyphs.
Cyrillic non-Russian problematic letters are і and i
, ј and j
, ѕ and s
, Ғ and F
, Ԍ and G
, Ү and Y
. Cyrillic ёїӧ can also be used if an IDN itself is being spoofed, to fake ë
ï
ö
.
While Komi De
(ԁ), shha (һ), palochka
(Ӏ) and izhitsa
(ѵ) bear strong resemblance to Latin d, h, l and v, these letters are either rare or archaic and are not widely supported in most standard fonts (they are not included in the WGL-4
). Attempting to use them could cause a ransom note effect
.
A good example is the Russian Government website. The follow link provide partial screen shots that explain how the same domain name looks in:
, only omicron
ο and sometimes nu
ν appear identical to a Latin alphabet letter in the lowercase used for URLs. Fonts that are in italic type
will feature Greek alpha α looking like a Latin a.
This list increases if close matches are also allowed (such as Greek εικηρτυωχγ for eiknptuwxy). Using capital letters, the list expands greatly. Greek ΑΒΕΗΙΚΜΝΟΡΤΧΥΖ looks identical to Latin ABEHIKMNOPTXYZ. Greek ΑΓΒΕΗΚΜΟΠΡΤΦΧ looks similar to Cyrillic АГВЕНКМОПРТФХ (as do Cyrillic Л (Л) and Greek Λ in certain geometric sans-serif fonts), Greek letters κ and о look similar to Cyrillic к and о. Besides this Greek τ, φ can be similar to Cyrillic т, ф in some fonts, Greek δ resembles Cyrillic б in the Serbian alphabet, and the Cyrillic а also italicizes the same as its Latin counterpart, making it possible to substitute it for alpha or vice versa.
If an IDN itself is being spoofed, Greek beta β can be a substitute for German esszet ß
in some fonts (and in fact, code page 437
treats them as equivalent), as can Greek sigma ς for ç; accented Greek substitutes όίά can usually be used for óíá in many fonts, with the last of these (alpha) again only resembling a in italic type.
can contribute critical characters: Several Armenian characters like օ, ո, ս, as well capital Տ and Լ are often completely identical to Latin characters in modern fonts. Symbols like ա can resemble Cyrillic ш. Beside that, there are symbols which look alike. ցհոօզս which look like ghnoqu, յ which resembles j (albeit dotless), and ք, which can either resemble p or f depending on the font. However, the use of Armenian is problematic. Not all standard fonts feature the Armenian glyphs (whereas the Greek and Cyrillic scripts are in most standard fonts). Because of this, Windows prior to Windows 7 rendered Armenian in a distinct font, Sylfaen, which supports Armenian, and the mixing of Armenian with Latin would appear obviously different if using a font other than Sylfaen or a Unicode typeface
. (This is known as a ransom note effect
.) The current version of Tahoma
, used in Windows 7, supports Armenian (previous versions did not). Furthermore, this font differentiates Latin g from Armenian ց.
Two letters in Armenian (Ձշ) also can resemble the number 2, Յ resembles 3, while another (վ) sometimes resembles the number 4.
and not for substitution. Furthermore, the Hebrew alphabet
is written from right to left and trying to mix it with left-to-right glyphs may cause problems.
. Either way, this amounts to abandoning non-ASCII domain names.
One problem with displaying IDNs in Punycode
is that then, effectively, every such address is "a homograph" of every other. Since typical users cannot read punycode, any Chinese site rendered in Punycode would be indistinguishable from any other Chinese site.
Firefox and Opera
display punycode for IDNs unless the top-level domain (TLD, for example,
Internet Explorer 7
allows IDNs except for labels that mix scripts for different languages. Labels that mix scripts are displayed in punycode. There are exceptions to locales where ASCII characters are commonly mixed with localized scripts.
As an additional defense, Internet Explorer 7, Firefox 2.0 and above, and Opera 9.10 include phishing filters that attempt to alert users when they visit malicious websites.
Starting with version 7, Internet Explorer
was capable of using IDNs, but it imposes restrictions on displaying non-ASCII domain names based on a user-defined list of allowed languages and provides an anti-phishing filter that checks suspicious Web sites against a remote database of known phishing sites.
On February 17, 2005, Mozilla developers announced that the next software version still has IDN support enabled, but displaying the Punycode
URLs instead, thus thwarting some attacks exploiting similarities between ASCII and non-ASCII characters, while still permitting access to web sites in an IDN domain.
Since then, both Mozilla and Opera have announced that they will be using per-domain whitelists to selectively switch on IDN display for domain run by registries which are taking appropriate homograph spoofing attack precautions. As of September 9, 2005, the most recent version of Mozilla Firefox
as well as the most recent Internet Explorer display the spoofed Paypal URL as "http://www.xn--pypal-4ve.com/", clearly different from the original.
Safari's
approach is to render problematic character sets as Punycode
. This can be changed by altering the settings in Mac OS X's system files.
Google Chrome
displays an IDN only if all of its characters belong to one (and only one) of the user's preferred languages.
With the advent of internationalized country code
s spoofing will be minimized. For example, the Russian TLD .рф (a domain specifically chosen to avoid resembling a Latin TLD) only accepts Cyrillic names, forbidding the mix with Latin or Greek characters. However the problem in .com
and other gTLDs remains open.
These methods of defense only extend to within a browser. Homographic URLs that house malicious software can still be distributed, without being displayed as Punycode, through e-mail
, social networking or other Web sites without being detected until the user actually clicks the link. While the fake link will show in Punycode when it is clicked, by this point the page has already begun loading into the browser and the malicious software may have already been downloaded onto the computer. Television station KBOI-TV raised these concerns when an unknown source (registering under the name "Completely Anonymous") registered a domain name homographic to its own to spread an April Fool's Day joke regarding the Governor of Idaho issuing a supposed ban on the sale of music by Justin Bieber
.
Aside from its better known, and more malicious, purposes, homograph spoofing can be used for better purposes, such as address munging
, to thwart spam
bots.
Homograph
A homograph is a word or a group of words that share the same written form but have different meanings. When spoken, the meanings may be distinguished by different pronunciations, in which case the words are also heteronyms. Words with the same writing and pronunciation A homograph (from the ,...
s, hence the term for the attack). For example, a person frequenting citibank
Citibank
Citibank, a major international bank, is the consumer banking arm of financial services giant Citigroup. Citibank was founded in 1812 as the City Bank of New York, later First National City Bank of New York...
.com may be lured to click the link [сitibank.com] (punycode
Punycode
In computing, Punycode is an instance of a general encoding syntax by which a string of Unicode characters is transformed uniquely and reversibly into a smaller, restricted character set....
: xn--itibank-xjg.com/) where the Latin
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...
C
C
Ĉ or ĉ is a consonant in Esperanto orthography, representing the sound .Esperanto orthography uses a diacritic for all four of its postalveolar consonants, as do the Latin-based Slavic alphabets...
is replaced with the Cyrillic
Cyrillic alphabet
The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...
С.
This kind of spoofing attack
Spoofing attack
In the context of network security, a spoofing attack is a situation in which one person or program successfully masquerades as another by falsifying data and thereby gaining an illegitimate advantage.- Spoofing and TCP/IP :...
is also known as script spoofing. Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
incorporates numerous writing systems, and, for a number of reasons, similar-looking characters such as Greek Ο, Latin O
O
O is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...
, and Cyrillic О were not assigned the same code. Their incorrect or malicious usage is a possibility for security attacks.
The registration of homographic domain names is akin to typosquatting
Typosquatting
Typosquatting, also called URL hijacking, is a form of cybersquatting, and possibly brandjacking which relies on mistakes such as typographical errors made by Internet users when inputting a website address into a web browser...
. The major difference is that in typosquatting the perpetrator relies on natural human typos, while in homograph spoofing the perpetrator intentionally deceives the web surfer with visually indistinguishable names. Indeed, it would be a rare accident for a web user to type, e.g., a Cyrillic letter within an otherwise English word such as "citibank". There are cases in which a registration can be both typosquatting and homograph spoofing; the pairs of l/I, i/j, and 0/O are all both close together on keyboards and bear a certain amount of resemblance to each other.
Prehistory
An early nuisance of this kind, pre-dating the Internet and even text terminals, was the confusion between "l" (lowercase "ell") / "1" (the number "one") and "O" (capital letter "oh") / "0" (the number "zero"). Some typewriterTypewriter
A typewriter is a mechanical or electromechanical device with keys that, when pressed, cause characters to be printed on a medium, usually paper. Typically one character is printed per keypress, and the machine prints the characters by making ink impressions of type elements similar to the pieces...
s in the pre-computer era even conflated
Conflation
Conflation occurs when the identities of two or more individuals, concepts, or places, sharing some characteristics of one another, become confused until there seems to be only a single identity — the differences appear to become lost...
the ell and the one; users had to type a lowercase L when the number one was needed. The zero/oh confusion gave rise to the tradition of crossing zeros
Slashed zero
The slashed zero is a representation of the number '0' , with a slash through it. In character encoding terms, it is an alternate glyph for the self-same zero character...
, so that a computer operator
Computer operator
A role within IT, computer operators oversee the running of computer systems, ensuring that the machines are running and physically secured. The traditional role of a computer operator was to work with mainframes which required a great deal of management day-to-day, however nowaday they often work...
would type them correctly. Unicode may contribute to this greatly with its combining characters, accents, several types of hyphen
Hyphen
The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. The hyphen should not be confused with dashes , which are longer and have different uses, or with the minus sign which is also longer...
-alikes, etc., often due to inadequate rendering
Rendering (computer graphics)
Rendering is the process of generating an image from a model , by means of computer programs. A scene file contains objects in a strictly defined language or data structure; it would contain geometry, viewpoint, texture, lighting, and shading information as a description of the virtual scene...
support, especially with smaller fonts sizes and wide variety of fonts.
Even earlier, handwriting
Handwriting
Handwriting is a person's particular & individual style of writing with pen or pencil, which contrasts with "Hand" which is an impersonal and formalised writing style in several historical varieties...
provided rich opportunities for confusion. A notable example is the etymology of the word "zenith
Zenith
The zenith is an imaginary point directly "above" a particular location, on the imaginary celestial sphere. "Above" means in the vertical direction opposite to the apparent gravitational force at that location. The opposite direction, i.e...
". The translation from the Arabic "samt" included the scribe's confusing of "m" into "ni". This was common in medieval blackletter
Blackletter
Blackletter, also known as Gothic script, Gothic minuscule, or Textura, was a script used throughout Western Europe from approximately 1150 to well into the 17th century. It continued to be used for the German language until the 20th century. Fraktur is a notable script of this type, and sometimes...
, which did not connect the vertical columns on the letters i, m, n, or u, making them difficult to distinguish when several were in a row. The latter, as well as "rn"/"m"/"rri" ("RN"/"M"/"RRI") confusion, is still possible for a human eye even with modern advanced computer technology.
Intentional look-alike character substitution with different alphabets has also been known in various contexts. For example, Faux Cyrillic
Faux Cyrillic
Faux Cyrillic, pseudo-Cyrillic, pseudo-Russian or faux Russian typography is the use of Cyrillic letters in Latin text to evoke the Soviet Union or Russia, regardless of whether the letters are phonetic matches. For example, R and N in RUSSIAN may be replaced by Cyrillic Я and И, giving "ЯUSSIAИ"...
has been used as an amusement or attention-grabber and "Volapuk encoding
Volapuk encoding
Volapuk encoding or latinitsa is a slang term for rendering the letters of the Cyrillic script with Latin ones...
" was used in early days of the Internet
Internet
The Internet is a global system of interconnected computer networks that use the standard Internet protocol suite to serve billions of users worldwide...
as a way to overcome the lack of support for the Cyrillic alphabet.
Homographs in ASCII
ASCII has several characters or pairs of characters that look alike and are known as homoglyphHomoglyph
In typography, a homoglyph is one of two or more characters, or glyphs, with shapes that either appear identical or cannot be differentiated by quick visual inspection. This designation is also applied to sequences of characters sharing these properties....
s (a subgroup of homograph
Homograph
A homograph is a word or a group of words that share the same written form but have different meanings. When spoken, the meanings may be distinguished by different pronunciations, in which case the words are also heteronyms. Words with the same writing and pronunciation A homograph (from the ,...
s). Spoofing attack
Spoofing attack
In the context of network security, a spoofing attack is a situation in which one person or program successfully masquerades as another by falsifying data and thereby gaining an illegitimate advantage.- Spoofing and TCP/IP :...
s based on these similarities are known as homograph spoofing attacks. For example 0 (the number) and O (the letter), "l" lowercase L, and "I" uppercase "i".
In a typical example of a hypothetical attack, someone could register a domain name
Domain name
A domain name is an identification string that defines a realm of administrative autonomy, authority, or control in the Internet. Domain names are formed by the rules and procedures of the Domain Name System ....
that appears almost identical to an existing domain but goes somewhere else. For example, the domain "rnicrosoft.com" contains "r" and "n", not "m".
Other examples are G00GLE.COM which looks much like GOOGLE.COM in some fonts.
Using a mix of uppercase and lowercase characters, googIe.com (capital i, not small L) looks much like google.com in some fonts. PayPal
PayPal
PayPal is an American-based global e-commerce business allowing payments and money transfers to be made through the Internet. Online money transfers serve as electronic alternatives to paying with traditional paper methods, such as checks and money orders....
was a target of a phishing scam exploiting this, using the domain PayPaI.com
PayPaI
Paypai is a phishing scam, which targets account holders of the widely-used internet payment service, PayPal, using the fact that a capital "i" may be difficult to distinguish from a lower-case "L" in some computer fonts; a so-called homograph attack...
. In certain narrow-spaced fonts such as Tahoma
Tahoma (typeface)
Tahoma is a humanist sans-serif typeface designed by Matthew Carter for the Microsoft Corporation in 1994 with initial distribution along with Verdana for Windows 95....
(the default address bar in Windows XP
Windows XP
Windows XP is an operating system produced by Microsoft for use on personal computers, including home and business desktops, laptops and media centers. First released to computer manufacturers on August 24, 2001, it is the second most popular version of Windows, based on installed user base...
), placing a c in front of a j, l or i will produce homoglyphs such as cl cj ci (d g a).
Homographs in internationalized domain names
In multilingual computer systems, different logical characters may have identical appearances.For example, Unicode
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
character U+0430, Cyrillic
Cyrillic alphabet
The Cyrillic script or azbuka is an alphabetic writing system developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...
small letter a ("а"), can look identical to Unicode character U+0061, Latin
Latin alphabet
The Latin alphabet, also called the Roman alphabet, is the most recognized alphabet used in the world today. It evolved from a western variety of the Greek alphabet called the Cumaean alphabet, which was adopted and modified by the Etruscans who ruled early Rome...
small letter a, ("a") which is the lowercase "a" used in English.
The problem arises from the different treatment of the characters in the user's mind and the computer's programming. From the viewpoint of the user, a Cyrillic "а" within a Latin string is a Latin "a"; there is no difference in the glyphs for these characters in most fonts. However, the computer treats them differently when processing the character string as an identifier. Thus, the user's assumption of a one-to-one correspondence between the visual appearance of a name, and the named entity, breaks down.
Internationalized domain name
Internationalized domain name
An internationalized domain name is an Internet domain name that contains at least one label that is displayed in software applications, in whole or in part, in a language-specific script or alphabet, such as Arabic, Chinese, Russian, Hindi or the Latin alphabet-based characters with diacritics,...
s provide a backward-compatible way for domain names to use the full Unicode character set, and this standard is already widely supported. However this system expanded the character repertoire from a few dozen characters in a single alphabet to many thousands of characters in many scripts; this greatly increased the scope for homograph attacks.
This opens a rich vein of opportunities for phishing
Phishing
Phishing is a way of attempting to acquire information such as usernames, passwords, and credit card details by masquerading as a trustworthy entity in an electronic communication. Communications purporting to be from popular social web sites, auction sites, online payment processors or IT...
and other varieties of fraud. An attacker could register a domain name that looks just like that of a legitimate website, but in which some of the letters have been replaced by homographs in another alphabet. The attacker could then send e-mail messages purporting to come from the original site, but directing people to the bogus site. The spoof site could then record information such as passwords or account details, while passing traffic through to the real site. The victims may never notice the difference, until suspicious or criminal activity occurs with their accounts.
In December 2001 Evgeniy Gabrilovich
Evgeniy Gabrilovich
Evgeniy Gabrilovich is a prominent Computer Scientist at Yahoo! Research, specializing in Information Retrieval, Machine Learning, and Computational Linguistics, and a Senior Member of the Institute of Electrical and Electronics Engineers .- Career :...
and Alex Gontmakher, both from Technion, Israel
Israel
The State of Israel is a parliamentary republic located in the Middle East, along the eastern shore of the Mediterranean Sea...
, published a paper titled "The Homograph Attack", which described an attack that used Unicode URLs to spoof a website URL. To prove the feasibility of this kind of attack, the researchers successfully registered a variant of the domain name microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
.com which incorporated Russian language characters.
These kind of problems were anticipated before IDN was introduced, and guidelines were issued to registries to try to avoid or reduce the problem. For example, it was advised that registries only accept characters from the Latin alphabet and that of their own country, not all of Unicode characters, but this advice was neglected by major TLD
TLD
TLD is a three-letter initialism that may stand for:* Top-level domain, the last part of an Internet domain name* Tag Library Descriptor, an XML document that maps JSP tags to their handlers or associated files...
s.
On February 7, 2005, Slashdot
Slashdot
Slashdot is a technology-related news website owned by Geeknet, Inc. The site, which bills itself as "News for Nerds. Stuff that Matters", features user-submitted and ‑evaluated current affairs news stories about science- and technology-related topics. Each story has a comments section...
reported that this exploit was disclosed by 3ric Johanson at the hacker
Hacker (computer security)
In computer security and everyday language, a hacker is someone who breaks into computers and computer networks. Hackers may be motivated by a multitude of reasons, including profit, protest, or because of the challenge...
conference Shmoocon
ShmooCon
ShmooCon is an American hacker convention organized by The Shmoo Group. There are typically about 35 different talks and presentations, on a variety of subjects related to computer security and cyberculture.-History:...
. Web browsers supporting IDNA appeared to direct the URL
PayPal
PayPal is an American-based global e-commerce business allowing payments and money transfers to be made through the Internet. Online money transfers serve as electronic alternatives to paying with traditional paper methods, such as checks and money orders....
, but actually led to a spoofed web site with different content.
The following alphabets have characters that can be used for spoofing attacks (please note, these are only the most obvious and common, given artistic license and how much risk the spoofer will take of getting caught; the possibilities are far more numerous than can be listed here):
Cyrillic
Cyrillic, by far, is the most commonly used alphabet for homoglyphs, largely because it contains 11 lowercase glyphs that are identical or nearly identical to Latin counterparts.The Russian letters
Russian alphabet
The Russian alphabet is a form of the Cyrillic script, developed in the First Bulgarian Empire during the 10th century AD at the Preslav Literary School...
а, с, е, о, р, х and у have optical counterparts in the basic Latin alphabet and look close or identical to a
A
A is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...
, c
C
Ĉ or ĉ is a consonant in Esperanto orthography, representing the sound .Esperanto orthography uses a diacritic for all four of its postalveolar consonants, as do the Latin-based Slavic alphabets...
, e
E
E is the fifth letter and a vowel in the basic modern Latin alphabet. It is the most commonly used letter in the Czech, Danish, Dutch, English, French, German, Hungarian, Latin, Norwegian, Spanish, and Swedish languages.-History:...
, o
O
O is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...
, p
P
P is the sixteenth letter of the basic modern Latin alphabet.-Usage:In English and most other European languages, P is a voiceless bilabial plosive. Both initial and final Ps can be combined with many other discrete consonants in English words...
, x
X
X is the twenty-fourth letter in the basic modern Latin alphabet.-Uses:In mathematics, x is commonly used as the name for an independent variable or unknown value. The usage of x to represent an independent or unknown variable can be traced back to the Arabic word šay شيء = “thing,” used in Arabic...
and y
Y
Y is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...
. Cyrillic З, Ч and б resemble the numerals 3, 4 and 6. Italic type
Italic type
In typography, italic type is a cursive typeface based on a stylized form of calligraphic handwriting. Owing to the influence from calligraphy, such typefaces often slant slightly to the right. Different glyph shapes from roman type are also usually used—another influence from calligraphy...
generates more homoglyphs: дтпи (дтпи in standard type), resembling g
G
G is the seventh letter in the basic modern Latin alphabet.-History:The letter 'G' was introduced in the Old Latin period as a variant of ⟨c⟩ to distinguish voiced, from voiceless, . The recorded originator of ⟨g⟩ is freedman Spurius Carvilius Ruga, the first Roman to open a fee-paying school,...
m
M
M is the thirteenth letter of the basic modern Latin alphabet.-History:The letter M is derived from the Phoenician Mem, via the Greek Mu . Semitic Mem probably originally pictured water...
n
N
N is the fourteenth letter in the basic modern Latin alphabet.- History of the forms :One of the most common hieroglyphs, snake, was used in Egyptian writing to stand for a sound like English ⟨J⟩, because the Egyptian word for "snake" was djet...
u
U
U is the twenty-first letter and a vowel in the basic modern Latin alphabet.-History:The letter U ultimately comes from the Semitic letter Waw by way of the letter Y. See the letter Y for details....
(though in most standard fonts д instead resembles a partial differential sign, ∂).
If capital letters are counted, АВСЕНІЈКМОРЅТХ can substitute A
A
A is the first letter and a vowel in the basic modern Latin alphabet. It is similar to the Ancient Greek letter Alpha, from which it derives.- Origins :...
B
B
B is the second letter in the basic modern Latin alphabet. It is used to represent a variety of bilabial sounds , most commonly a voiced bilabial plosive.-History:...
C
C
Ĉ or ĉ is a consonant in Esperanto orthography, representing the sound .Esperanto orthography uses a diacritic for all four of its postalveolar consonants, as do the Latin-based Slavic alphabets...
E
E
E is the fifth letter and a vowel in the basic modern Latin alphabet. It is the most commonly used letter in the Czech, Danish, Dutch, English, French, German, Hungarian, Latin, Norwegian, Spanish, and Swedish languages.-History:...
H
H
H .) is the eighth letter in the basic modern Latin alphabet.-History:The Semitic letter ⟨ח⟩ most likely represented the voiceless pharyngeal fricative . The form of the letter probably stood for a fence or posts....
I
I
I is the ninth letter and a vowel in the basic modern Latin alphabet.-History:In Semitic, the letter may have originated in a hieroglyph for an arm that represented a voiced pharyngeal fricative in Egyptian, but was reassigned to by Semites, because their word for "arm" began with that sound...
J
J
Ĵ or ĵ is a letter in Esperanto orthography representing the sound .While Esperanto orthography uses a diacritic for its four postalveolar consonants, as do the Latin-based Slavic alphabets, the base letters are Romano-Germanic...
K
K
K is the eleventh letter of the English and basic modern Latin alphabet.-History and usage:In English, the letter K usually represents the voiceless velar plosive; this sound is also transcribed by in the International Phonetic Alphabet and X-SAMPA....
M
M
M is the thirteenth letter of the basic modern Latin alphabet.-History:The letter M is derived from the Phoenician Mem, via the Greek Mu . Semitic Mem probably originally pictured water...
O
O
O is the fifteenth letter and a vowel in the basic modern Latin alphabet.The letter was derived from the Semitic `Ayin , which represented a consonant, probably , the sound represented by the Arabic letter ع called `Ayn. This Semitic letter in its original form seems to have been inspired by a...
P
P
P is the sixteenth letter of the basic modern Latin alphabet.-Usage:In English and most other European languages, P is a voiceless bilabial plosive. Both initial and final Ps can be combined with many other discrete consonants in English words...
S
S
S is the nineteenth letter in the ISO basic Latin alphabet.-History: Semitic Šîn represented a voiceless postalveolar fricative . Greek did not have this sound, so the Greek sigma came to represent...
T
T
T is the 20th letter in the basic modern Latin alphabet. It is the most commonly used consonant and the second most common letter in the English language.- History :Taw was the last letter of the Western Semitic and Hebrew alphabets...
X
X
X is the twenty-fourth letter in the basic modern Latin alphabet.-Uses:In mathematics, x is commonly used as the name for an independent variable or unknown value. The usage of x to represent an independent or unknown variable can be traced back to the Arabic word šay شيء = “thing,” used in Arabic...
, in addition to the capitals for the lowercase Cyrillic homoglyphs. In the Serbian alphabet and handwritten based fonts, Cyrillic Д and Latin D are homoglyphs.
Cyrillic non-Russian problematic letters are і and i
I
I is the ninth letter and a vowel in the basic modern Latin alphabet.-History:In Semitic, the letter may have originated in a hieroglyph for an arm that represented a voiced pharyngeal fricative in Egyptian, but was reassigned to by Semites, because their word for "arm" began with that sound...
, ј and j
J
Ĵ or ĵ is a letter in Esperanto orthography representing the sound .While Esperanto orthography uses a diacritic for its four postalveolar consonants, as do the Latin-based Slavic alphabets, the base letters are Romano-Germanic...
, ѕ and s
S
S is the nineteenth letter in the ISO basic Latin alphabet.-History: Semitic Šîn represented a voiceless postalveolar fricative . Greek did not have this sound, so the Greek sigma came to represent...
, Ғ and F
F
F is the sixth letter in the basic modern Latin alphabet.-History:The origin of ⟨f⟩ is the Semitic letter vâv that represented a sound like or . Graphically, it originally probably depicted either a hook or a club...
, Ԍ and G
G
G is the seventh letter in the basic modern Latin alphabet.-History:The letter 'G' was introduced in the Old Latin period as a variant of ⟨c⟩ to distinguish voiced, from voiceless, . The recorded originator of ⟨g⟩ is freedman Spurius Carvilius Ruga, the first Roman to open a fee-paying school,...
, Ү and Y
Y
Y is the twenty-fifth letter in the basic modern Latin alphabet and represents either a vowel or a consonant in English.-Name:In Latin, Y was named Y Graeca "Greek Y". This was pronounced as I Graeca "Greek I", since Latin speakers had trouble pronouncing , which was not a native sound...
. Cyrillic ёїӧ can also be used if an IDN itself is being spoofed, to fake ë
Ë
is a letter in the Albanian, Ripuarian, Uyghur Latin Script, Ladin, and Kashubian languages. This letter also appears in Afrikaans, Dutch, French, Abruzzese dialect , and Luxembourgish language as a variant of letter "e"...
ï
Ï
', lowercase ', is a symbol used in various languages written with the Latin alphabet and in Ukrainian language which is written with the Cyrillic based Ukrainian alphabet; it can be read as the letter I with diaeresis or I-umlaut....
ö
Ö
"Ö", or "ö", is a character used in several extended Latin alphabets, or the letter O with umlaut to denote the front vowels or . In languages without umlaut, the character is also used as a "O with diaeresis" to denote a syllable break, wherein its pronunciation remains an unmodified .- O-Umlaut...
.
While Komi De
Komi De
Komi De is a letter of the Molodtsov alphabet, a variant of the Cyrillic alphabet. It was used only in the writing of the Komi language in the 1920s and in the Mordvin language...
(ԁ), shha (һ), palochka
Palochka
Palochka is a letter of the Cyrillic alphabet. This letter usually has only a capital form, which is also used in lowercase text. The capital form of Palochka often looks like the capital form of the Cyrillic letter Dotted I , the capital form of the Latin letter I , and the lowercase form of the...
(Ӏ) and izhitsa
Izhitsa
Izhitsa is a letter of the early Cyrillic alphabet. It was used to represent ypsilon in words derived from Greek, such as . It represented the same sound /i/ as the normal letter и in Russian...
(ѵ) bear strong resemblance to Latin d, h, l and v, these letters are either rare or archaic and are not widely supported in most standard fonts (they are not included in the WGL-4
Windows Glyph List 4
Windows Glyph List 4, or more commonly WGL4 for short, also known as the Pan-European character set, is a character repertoire on recent Microsoft operating systems comprising 652 Unicode characters...
). Attempting to use them could cause a ransom note effect
Ransom note effect
In typography, the ransom note effect is the result of using an excessive number of juxtaposed typefaces. It takes its name from the appearance of a stereotypical ransom note, with the message formed from words or letters cut randomly from a magazine or newspaper in order to avoid using...
.
A good example is the Russian Government website. The follow link provide partial screen shots that explain how the same domain name looks in:
- (a) An IDN enabled browser (SUNDIAL) in Russian
- (b) An non-IDN enabled browser (CHROME) in Russian translated to Punycode
- (b) An non-IDN enabled browser (CHROME) in English
Greek
From the Greek alphabetGreek alphabet
The Greek alphabet is the script that has been used to write the Greek language since at least 730 BC . The alphabet in its classical and modern form consists of 24 letters ordered in sequence from alpha to omega...
, only omicron
Omicron
Omicron is the 15th letter of the Greek alphabet. In the system of Greek numerals it has a value of 70. It is rarely used in mathematics because it is indistinguishable from the Latin letter O and easily confused with the digit 0...
ο and sometimes nu
Nu (letter)
Nu , is the 13th letter of the Greek alphabet. In the system of Greek numerals it has a value of 50...
ν appear identical to a Latin alphabet letter in the lowercase used for URLs. Fonts that are in italic type
Italic type
In typography, italic type is a cursive typeface based on a stylized form of calligraphic handwriting. Owing to the influence from calligraphy, such typefaces often slant slightly to the right. Different glyph shapes from roman type are also usually used—another influence from calligraphy...
will feature Greek alpha α looking like a Latin a.
This list increases if close matches are also allowed (such as Greek εικηρτυωχγ for eiknptuwxy). Using capital letters, the list expands greatly. Greek ΑΒΕΗΙΚΜΝΟΡΤΧΥΖ looks identical to Latin ABEHIKMNOPTXYZ. Greek ΑΓΒΕΗΚΜΟΠΡΤΦΧ looks similar to Cyrillic АГВЕНКМОПРТФХ (as do Cyrillic Л (Л) and Greek Λ in certain geometric sans-serif fonts), Greek letters κ and о look similar to Cyrillic к and о. Besides this Greek τ, φ can be similar to Cyrillic т, ф in some fonts, Greek δ resembles Cyrillic б in the Serbian alphabet, and the Cyrillic а also italicizes the same as its Latin counterpart, making it possible to substitute it for alpha or vice versa.
If an IDN itself is being spoofed, Greek beta β can be a substitute for German esszet ß
ß
In the German alphabet, ß is a letter that originated as a ligature of ss or sz. Like double "s", it is pronounced as an , but in standard spelling, it is only used after long vowels and diphthongs, while ss is used after short vowels...
in some fonts (and in fact, code page 437
Code page 437
IBM PC or MS-DOS code page 437 is the character set of the original IBM PC. It is also known as CP 437, OEM 437, PC-8, MS-DOS Latin US or sometimes misleadingly referred to as the OEM font, High ASCII or Extended ASCII....
treats them as equivalent), as can Greek sigma ς for ç; accented Greek substitutes όίά can usually be used for óíá in many fonts, with the last of these (alpha) again only resembling a in italic type.
Armenian
Also the Armenian alphabetArmenian alphabet
The Armenian alphabet is an alphabet that has been used to write the Armenian language since the year 405 or 406. It was devised by Saint Mesrop Mashtots, an Armenian linguist and ecclesiastical leader, and contained originally 36 letters. Two more letters, օ and ֆ, were added in the Middle Ages...
can contribute critical characters: Several Armenian characters like օ, ո, ս, as well capital Տ and Լ are often completely identical to Latin characters in modern fonts. Symbols like ա can resemble Cyrillic ш. Beside that, there are symbols which look alike. ցհոօզս which look like ghnoqu, յ which resembles j (albeit dotless), and ք, which can either resemble p or f depending on the font. However, the use of Armenian is problematic. Not all standard fonts feature the Armenian glyphs (whereas the Greek and Cyrillic scripts are in most standard fonts). Because of this, Windows prior to Windows 7 rendered Armenian in a distinct font, Sylfaen, which supports Armenian, and the mixing of Armenian with Latin would appear obviously different if using a font other than Sylfaen or a Unicode typeface
Unicode typefaces
A Unicode font is a computer font that contains a wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc., which are collectively mapped into the standard Universal Character Set, derived from many different languages and scripts from around the world...
. (This is known as a ransom note effect
Ransom note effect
In typography, the ransom note effect is the result of using an excessive number of juxtaposed typefaces. It takes its name from the appearance of a stereotypical ransom note, with the message formed from words or letters cut randomly from a magazine or newspaper in order to avoid using...
.) The current version of Tahoma
Tahoma (typeface)
Tahoma is a humanist sans-serif typeface designed by Matthew Carter for the Microsoft Corporation in 1994 with initial distribution along with Verdana for Windows 95....
, used in Windows 7, supports Armenian (previous versions did not). Furthermore, this font differentiates Latin g from Armenian ց.
Two letters in Armenian (Ձշ) also can resemble the number 2, Յ resembles 3, while another (վ) sometimes resembles the number 4.
Hebrew
Hebrew spoofing is generally rare. Only three letters from that alphabet can reliably be used: samekh (ס), which sometimes resembles o, vav with diacritic (וֹ), which resembles an i, and heth (ח), which resembles the letter n. Less accurate approximants for some other alphanumerics can also be found, but these are usually only accurate enough to use for the purposes of foreign brandingForeign branding
Foreign branding is an advertising and marketing term describing the implied cachet or superiority of products and services with foreign or foreign-sounding names.-English-speaking countries:...
and not for substitution. Furthermore, the Hebrew alphabet
Hebrew alphabet
The Hebrew alphabet , known variously by scholars as the Jewish script, square script, block script, or more historically, the Assyrian script, is used in the writing of the Hebrew language, as well as other Jewish languages, most notably Yiddish, Ladino, and Judeo-Arabic. There have been two...
is written from right to left and trying to mix it with left-to-right glyphs may cause problems.
Defending against the attack
The simplest defense is for web browsers not to support IDNA or other similar mechanisms, or for users to turn off whatever support their browsers have. That could mean blocking access to IDNA sites, but generally browsers permit access and just display IDNs in PunycodePunycode
In computing, Punycode is an instance of a general encoding syntax by which a string of Unicode characters is transformed uniquely and reversibly into a smaller, restricted character set....
. Either way, this amounts to abandoning non-ASCII domain names.
One problem with displaying IDNs in Punycode
Punycode
In computing, Punycode is an instance of a general encoding syntax by which a string of Unicode characters is transformed uniquely and reversibly into a smaller, restricted character set....
is that then, effectively, every such address is "a homograph" of every other. Since typical users cannot read punycode, any Chinese site rendered in Punycode would be indistinguishable from any other Chinese site.
Firefox and Opera
Opera (web browser)
Opera is a web browser and Internet suite developed by Opera Software with over 200 million users worldwide. The browser handles common Internet-related tasks such as displaying web sites, sending and receiving e-mail messages, managing contacts, chatting on IRC, downloading files via BitTorrent,...
display punycode for IDNs unless the top-level domain (TLD, for example,
.ac
or .museum
) prevents homograph attacks by restricting which characters can be used in domain names. They both also allow users to manually add TLDs to the allowed list.Internet Explorer 7
Internet Explorer 7
Windows Internet Explorer 7 is a web browser released by Microsoft in October 2006. Internet Explorer 7 is part of a long line of versions of Internet Explorer and was the first major update to the browser in more than 5 years...
allows IDNs except for labels that mix scripts for different languages. Labels that mix scripts are displayed in punycode. There are exceptions to locales where ASCII characters are commonly mixed with localized scripts.
As an additional defense, Internet Explorer 7, Firefox 2.0 and above, and Opera 9.10 include phishing filters that attempt to alert users when they visit malicious websites.
Starting with version 7, Internet Explorer
Internet Explorer
Windows Internet Explorer is a series of graphical web browsers developed by Microsoft and included as part of the Microsoft Windows line of operating systems, starting in 1995. It was first released as part of the add-on package Plus! for Windows 95 that year...
was capable of using IDNs, but it imposes restrictions on displaying non-ASCII domain names based on a user-defined list of allowed languages and provides an anti-phishing filter that checks suspicious Web sites against a remote database of known phishing sites.
On February 17, 2005, Mozilla developers announced that the next software version still has IDN support enabled, but displaying the Punycode
Punycode
In computing, Punycode is an instance of a general encoding syntax by which a string of Unicode characters is transformed uniquely and reversibly into a smaller, restricted character set....
URLs instead, thus thwarting some attacks exploiting similarities between ASCII and non-ASCII characters, while still permitting access to web sites in an IDN domain.
Since then, both Mozilla and Opera have announced that they will be using per-domain whitelists to selectively switch on IDN display for domain run by registries which are taking appropriate homograph spoofing attack precautions. As of September 9, 2005, the most recent version of Mozilla Firefox
Mozilla Firefox
Mozilla Firefox is a free and open source web browser descended from the Mozilla Application Suite and managed by Mozilla Corporation. , Firefox is the second most widely used browser, with approximately 25% of worldwide usage share of web browsers...
as well as the most recent Internet Explorer display the spoofed Paypal URL as "http://www.xn--pypal-4ve.com/", clearly different from the original.
Safari's
Safari (web browser)
Safari is a web browser developed by Apple Inc. and included with the Mac OS X and iOS operating systems. First released as a public beta on January 7, 2003 on the company's Mac OS X operating system, it became Apple's default browser beginning with Mac OS X v10.3 "Panther". Safari is also the...
approach is to render problematic character sets as Punycode
Punycode
In computing, Punycode is an instance of a general encoding syntax by which a string of Unicode characters is transformed uniquely and reversibly into a smaller, restricted character set....
. This can be changed by altering the settings in Mac OS X's system files.
Google Chrome
Google Chrome
Google Chrome is a web browser developed by Google that uses the WebKit layout engine. It was first released as a beta version for Microsoft Windows on September 2, 2008, and the public stable release was on December 11, 2008. The name is derived from the graphical user interface frame, or...
displays an IDN only if all of its characters belong to one (and only one) of the user's preferred languages.
With the advent of internationalized country code
Internationalized country code
An internationalized country code top-level domain is a top-level domain in the Domain Name System of the Internet...
s spoofing will be minimized. For example, the Russian TLD .рф (a domain specifically chosen to avoid resembling a Latin TLD) only accepts Cyrillic names, forbidding the mix with Latin or Greek characters. However the problem in .com
.com
The domain name com is a generic top-level domain in the Domain Name System of the Internet. Its name is derived from commercial, indicating its original intended purpose for domains registered by commercial organizations...
and other gTLDs remains open.
These methods of defense only extend to within a browser. Homographic URLs that house malicious software can still be distributed, without being displayed as Punycode, through e-mail
E-mail
Electronic mail, commonly known as email or e-mail, is a method of exchanging digital messages from an author to one or more recipients. Modern email operates across the Internet or other computer networks. Some early email systems required that the author and the recipient both be online at the...
, social networking or other Web sites without being detected until the user actually clicks the link. While the fake link will show in Punycode when it is clicked, by this point the page has already begun loading into the browser and the malicious software may have already been downloaded onto the computer. Television station KBOI-TV raised these concerns when an unknown source (registering under the name "Completely Anonymous") registered a domain name homographic to its own to spread an April Fool's Day joke regarding the Governor of Idaho issuing a supposed ban on the sale of music by Justin Bieber
Justin Bieber
Justin Drew Bieber is a Canadian pop/R&B singer, songwriter and actor. Bieber was discovered in 2008 by Scooter Braun, who came across Bieber's videos on YouTube and later became his manager...
.
Aside from its better known, and more malicious, purposes, homograph spoofing can be used for better purposes, such as address munging
Address munging
Address munging is the practice of disguising, or munging, an e-mail address to prevent it being automatically collected and used as a target for people and organizations who send unsolicited bulk e-mail...
, to thwart spam
Spam (electronic)
Spam is the use of electronic messaging systems to send unsolicited bulk messages indiscriminately...
bots.