Unicode input
Encyclopedia
Unicode input is the insertion of a specific Unicode
character on a computer. Unicode characters can be inserted in two ways: from the screen by means of an applet
from which one can select the character, or by input of the Unicode character from the keyboard
. Many systems provide support for Unicode input in some form.
to a code point
, which is represented by a Unicode number. In general, a Unicode number exists of U, followed by four or five hexadecimal digits, for example U+00AE or U+1D310. Characters in the Basic Multilingual Plane (BMP), containing modern script
s – including many Chinese and Japanese characters – and many symbols, have a 4-digit code. Historic scripts, but also many modern symbols and pictographs (such as emoticon
s, playing card
s and many CJK characters have 5-digit codes.
, particularly those using the RichEdit control, decimal Unicode code points (for example, 256 for U+0100) are supported with Alt codes.
, the number sign (#) is used instead of "U". The number can be either in decimals or in hexadecimals. Preceding zeros may be omitted. If the input is in hexadecimal, the number is preceded by an "x". Some characters can also be used by "entity name".
Example: The HTML code of the copyright sign U+00A9 (©) can be © (decimal input) or © (hexadecimal input) or © (entity name).
. The availability of a specific character depends on its presence in the specified font; every font has its own number of characters, or even none at all. Most characters will not be available. If a desired character is not present in the available fonts, a suitable font should be installed on the system. The character can be displayed on the system now, but if it is a more or less exotic one, it will not be visible on many other systems. An empty box, a question mark or another replacement will be shown: ꠩, �.
Older web browsers can only display text supported by the current font associated with the character encoding
of the page. Some modern browsers, such as Mozilla Firefox, Opera, Safari and Internet Explorer (version ≥ 7), are able to display multilingual web pages by intelligently choosing a font to display each individual character on the page. They will correctly display any mix of Unicode block
s, as long as appropriate fonts are present in the operating system. After installation of a missing font the browser will find the correct Unicode character automatically after restart.
Microsoft Windows
has provided a Unicode version of the Character Map program since version NT 4.0 - appearing in the consumer edition since XP. This is limited to characters in the Basic Multilingual Plane (BMP). Characters are searchable by Unicode character name, and the table can be limited to a particular code block.
Mac OS X
provides a "character palette" with much the same functionality, along with searching by related characters, glyph tables in a font, etc. It must first be enabled under System Preferences -> International -> Input Menu (or System Preferences -> Language and Text -> Input Sources).
Equivalent tools – such as gucharmap
(GNOME
) or kcharselect (KDE
) – exist on most Linux desktop environments.
representation of the code point
and the ending sequence. On some systems, this is limited to the BMP (characters up to U+FFFF).
Different font
s have different glyphs for the same Unicode, thus the appearance of the character will depend on the font which is defined in the webbrowser or application. Also, not every Unicode is available in every font.
The RichEdit control on Microsoft Windows (as used in for example WordPad
) supports the following input method: one first enters the character’s hexadecimal code (between two and six hexadecimal digits), then immediately presses
2002/2003 for Windows.
, select "Edit" then "Special Characters..." to open up a pane for selecting characters. Select the desired character or enter the Unicode code point in the text field at the bottom . In Mac OS 8.5 and later: one chooses the Unicode Hex Input keyboard layout. Holding down the Option key
, one then types the four-digit hex
Unicode code point. On releasing the Option key; the equivalent character will appear.
depends on the system and applications. In the Xorg Server
, hex input is still not implemented. KDE
/Qt
uses standard Xorg input (XIM), and therefore refuses to implement its own solution.
GTK+
is an ISO 14755-conformant system. The beginning sequence is CTRL+SHIFT+U and the ending sequence is ENTER. Programs based on GTK+, such as GNOME applications, support Unicode input. OpenOffice.org
and the webbrowser Firefox allow Unicode characters.
There are two methods for direct input of Unicode characters on linux:
In OpenOffice.org and Inkscape
, for example, only the second method works.
In the terminal it depends on the actual shell
, but usually, using escape sequences is a workaround.
The capability of Vim to create custom digraphs, as described below, which could be employed on an ad-hoc basis, requires the decimal code point.
for code points in Unicode 1.0 (as well as characters in ISO 2DIS 10646 and many other character sets in use at the time of publication). Although the document does not restrict the length of a mnemonic (for example, "10000R" for U+2821), most (1,338), of the mnemonics are digraphs, that is, two characters long, and most (416) of the remaining are trigraphs. While never complete, and targeting obsolescent set definitions, the mnemonics themselves can still be used.
RFC 1345 predates the introduction of the Euro sign
(€, U+20AC), but the above applications included it as digraph "Eu".
Another free tool is UnicodeIt. It converts LaTeX expressions like \alpha into Unicode. On the Mac, this works in most programs (including Keynote and Mail) using a keyboard shortcut. This tool also has an online version at http://www.unicodeit.net which works on most platforms including smart phones.
There is also a free webtool called Shapecatcher that can by used to find Unicode characters by drawing them.
Unicode
Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems...
character on a computer. Unicode characters can be inserted in two ways: from the screen by means of an applet
Applet
In computing, an applet is any small application that performs one specific task that runs within the scope of a larger program, often as a plug-in. An applet typically also refers to Java applets, i.e., programs written in the Java programming language that are included in a web page...
from which one can select the character, or by input of the Unicode character from the keyboard
Keyboard (computing)
In computing, a keyboard is a typewriter-style keyboard, which uses an arrangement of buttons or keys, to act as mechanical levers or electronic switches...
. Many systems provide support for Unicode input in some form.
Unicode numbers
Each Unicode character is mappedMapping of Unicode characters
Unicode’s Universal Character Set has a potential capacity to support over 1 million characters. Each UCS character is mapped to a code point which is an integer between 0 and 1,114,111 used to represent each character within the internal logic of text processing software .As of Unicode 5.2.0,...
to a code point
Code point
In character encoding terminology, a code point or code position is any of the numerical values that make up the code space . For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112...
, which is represented by a Unicode number. In general, a Unicode number exists of U, followed by four or five hexadecimal digits, for example U+00AE or U+1D310. Characters in the Basic Multilingual Plane (BMP), containing modern script
Writing system
A writing system is a symbolic system used to represent elements or statements expressible in language.-General properties:Writing systems are distinguished from other possible symbolic communication systems in that the reader must usually understand something of the associated spoken language to...
s – including many Chinese and Japanese characters – and many symbols, have a 4-digit code. Historic scripts, but also many modern symbols and pictographs (such as emoticon
Emoticon
An emoticon is a facial expression pictorially represented by punctuation and letters, usually to express a writer’s mood. Emoticons are often used to alert a responder to the tenor or temper of a statement, and can change and improve interpretation of plain text. The word is a portmanteau word...
s, playing card
Playing card
A playing card is a piece of specially prepared heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic, marked with distinguishing motifs and used as one of a set for playing card games...
s and many CJK characters have 5-digit codes.
Decimal input
In some applications on Microsoft WindowsMicrosoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
, particularly those using the RichEdit control, decimal Unicode code points (for example, 256 for U+0100) are supported with Alt codes.
Unicode in HTML
In HTMLUnicode and HTML
Web pages authored using hypertext markup language may contain multilingual text represented with the Unicode universal character set....
, the number sign (#) is used instead of "U". The number can be either in decimals or in hexadecimals. Preceding zeros may be omitted. If the input is in hexadecimal, the number is preceded by an "x". Some characters can also be used by "entity name".
Example: The HTML code of the copyright sign U+00A9 (©) can be © (decimal input) or © (hexadecimal input) or © (entity name).
Availability
For displaying a Unicode character, it must be present in the chosen fontFont
In typography, a font is traditionally defined as a quantity of sorts composing a complete character set of a single size and style of a particular typeface...
. The availability of a specific character depends on its presence in the specified font; every font has its own number of characters, or even none at all. Most characters will not be available. If a desired character is not present in the available fonts, a suitable font should be installed on the system. The character can be displayed on the system now, but if it is a more or less exotic one, it will not be visible on many other systems. An empty box, a question mark or another replacement will be shown: ꠩, �.
Older web browsers can only display text supported by the current font associated with the character encoding
Character encoding
A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data through telecommunication networks or storage of text in...
of the page. Some modern browsers, such as Mozilla Firefox, Opera, Safari and Internet Explorer (version ≥ 7), are able to display multilingual web pages by intelligently choosing a font to display each individual character on the page. They will correctly display any mix of Unicode block
Unicode block
In Unicode, a block is defined as one contiguous range of code points. Blocks are named uniquely and have no overlap. They may be defined with the starting and ending code points. The block explicitly can include code points that are unassigned and non-characters. Code points not belonging to any...
s, as long as appropriate fonts are present in the operating system. After installation of a missing font the browser will find the correct Unicode character automatically after restart.
Selection from a screen
Many systems provide a way to select Unicode characters visually. ISO 14755 refers to this as a screen-selection entry method.Microsoft Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...
has provided a Unicode version of the Character Map program since version NT 4.0 - appearing in the consumer edition since XP. This is limited to characters in the Basic Multilingual Plane (BMP). Characters are searchable by Unicode character name, and the table can be limited to a particular code block.
Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
provides a "character palette" with much the same functionality, along with searching by related characters, glyph tables in a font, etc. It must first be enabled under System Preferences -> International -> Input Menu (or System Preferences -> Language and Text -> Input Sources).
Equivalent tools – such as gucharmap
Gucharmap
gucharmap is a Unicode character map program, part of GNOME desktop. This program allows for displaying characters by unicode block or script type and includes brief descriptions of related characters and occasionally meanings of the character in question...
(GNOME
GNOME
GNOME is a desktop environment and graphical user interface that runs on top of a computer operating system. It is composed entirely of free and open source software...
) or kcharselect (KDE
KDE
KDE is an international free software community producing an integrated set of cross-platform applications designed to run on Linux, FreeBSD, Microsoft Windows, Solaris and Mac OS X systems...
) – exist on most Linux desktop environments.
Hexadecimal code input
Clause 5.1 of ISO 14755 describes a Basic method whereby a beginning sequence is followed by the hex numberHexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...
representation of the code point
Code point
In character encoding terminology, a code point or code position is any of the numerical values that make up the code space . For example, ASCII comprises 128 code points in the range 0hex to 7Fhex, Extended ASCII comprises 256 code points in the range 0hex to FFhex, and Unicode comprises 1,114,112...
and the ending sequence. On some systems, this is limited to the BMP (characters up to U+FFFF).
Different font
Font
In typography, a font is traditionally defined as a quantity of sorts composing a complete character set of a single size and style of a particular typeface...
s have different glyphs for the same Unicode, thus the appearance of the character will depend on the font which is defined in the webbrowser or application. Also, not every Unicode is available in every font.
In Microsoft Windows
In Microsoft Windows, if the registry keyHKEY_CURRENT_USER\Control Panel\Input Method\EnableHexNumpad
has a string value of "1", holding down Alt and pressing the "plus" on the numeric keypad, followed by the hex code (using the main letter keys and any of the number keys), then releasing Alt will work. (You must reboot after setting this registry key for this input method to start working.)The RichEdit control on Microsoft Windows (as used in for example WordPad
WordPad
WordPad is a basic word processor that is included with almost all versions of Microsoft Windows from Windows 95 upwards. It is more advanced than Notepad but simpler than Microsoft Works Word Processor and Microsoft Word. It replaced Microsoft Write....
) supports the following input method: one first enters the character’s hexadecimal code (between two and six hexadecimal digits), then immediately presses
Alt + x
. For example, entering f1 and then pressing the combination will produce the character ñ. The code must not be preceded by any digit or letters a-f as they will be treated as part of the code to be converted. This also works on Microsoft WordMicrosoft Word
Microsoft Word is a word processor designed by Microsoft. It was first released in 1983 under the name Multi-Tool Word for Xenix systems. Subsequent versions were later written for several other platforms including IBM PCs running DOS , the Apple Macintosh , the AT&T Unix PC , Atari ST , SCO UNIX,...
2002/2003 for Windows.
In Mac OS
In Mac OS XMac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...
, select "Edit" then "Special Characters..." to open up a pane for selecting characters. Select the desired character or enter the Unicode code point in the text field at the bottom . In Mac OS 8.5 and later: one chooses the Unicode Hex Input keyboard layout. Holding down the Option key
Option key
The Option key is a modifier key present on Apple keyboards. It is located between the Control key and Command key on a typical Mac keyboard. There are two option keys on modern Mac desktop and notebook keyboards, one on each side of the space bar....
, one then types the four-digit hex
Hexadecimal
In mathematics and computer science, hexadecimal is a positional numeral system with a radix, or base, of 16. It uses sixteen distinct symbols, most often the symbols 0–9 to represent values zero to nine, and A, B, C, D, E, F to represent values ten to fifteen...
Unicode code point. On releasing the Option key; the equivalent character will appear.
In Linux
The possibility of hexadecimal code input in linuxLinux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
depends on the system and applications. In the Xorg Server
X.Org Server
X.Org Server refers to the X server release packages stewarded by the X.Org Foundation,which is hosted by freedesktop.org, and grants...
, hex input is still not implemented. KDE
KDE
KDE is an international free software community producing an integrated set of cross-platform applications designed to run on Linux, FreeBSD, Microsoft Windows, Solaris and Mac OS X systems...
/Qt
QT
- Units and measurements :* Quart, a measure of volume in traditional systems of units* Quarter, a unit of mass in Imperial units, equal to 2 stones or a quarter of a long hundredweight; one quarter = 12.70058636 kg* QT interval, a measurement used in cardiology...
uses standard Xorg input (XIM), and therefore refuses to implement its own solution.
GTK+
GTK+
GTK+ is a cross-platform widget toolkit for creating graphical user interfaces. It is licensed under the terms of the GNU LGPL, allowing both free and proprietary software to use it. It is one of the most popular toolkits for the X Window System, along with Qt.The name GTK+ originates from GTK;...
is an ISO 14755-conformant system. The beginning sequence is CTRL+SHIFT+U and the ending sequence is ENTER. Programs based on GTK+, such as GNOME applications, support Unicode input. OpenOffice.org
OpenOffice.org
OpenOffice.org, commonly known as OOo or OpenOffice, is an open-source application suite whose main components are for word processing, spreadsheets, presentations, graphics, and databases. OpenOffice is available for a number of different computer operating systems, is distributed as free software...
and the webbrowser Firefox allow Unicode characters.
There are two methods for direct input of Unicode characters on linux:
- Hold Ctrl+Shift and type u and the four numbers of the hex code, for example u20ac. Then release all keys. This should give the euro signEuro signThe euro sign is the currency sign used for the euro, the official currency of the Eurozone in the European Union . The design was presented to the public by the European Commission on 12 December 1996. The international three-letter code for the euro is EUR...
(€)
(GTK+ versions before 2.10, do not use Ctrl-Shift-U, only Ctrl-Shift-[hex number].)
This method does not work in all applications.
- Hold Ctrl+Shift and type u. Then release all keys. Then type the four numbers of the hex code and finish with Enter.
In OpenOffice.org and Inkscape
Inkscape
Inkscape is a free software vector graphics editor, licensed under the GNU General Public License. Its goal is to implement full support for the Scalable Vector Graphics 1.1 standard....
, for example, only the second method works.
In the terminal it depends on the actual shell
Shell (computing)
A shell is a piece of software that provides an interface for users of an operating system which provides access to the services of a kernel. However, the term is also applied very loosely to applications and may include any software that is "built around" a particular component, such as web...
, but usually, using escape sequences is a workaround.
In platform independent applications
- In the Vim editor, the user first types
Ctrl-V u
, then types in the hexadecimal number of the symbol or character desired, and it will be converted into the symbol. (On Microsoft Windows,Ctrl-Q
may be required instead ofCtrl-V
.)
The capability of Vim to create custom digraphs, as described below, which could be employed on an ad-hoc basis, requires the decimal code point.
- In EmacsEmacsEmacs is a class of text editors, usually characterized by their extensibility. GNU Emacs has over 1,000 commands. It also allows the user to combine these commands into macros to automate work.Development began in the mid-1970s and continues actively...
,M
Meta keyThe meta key is a special key on MIT keyboards, such as the space-cadet keyboard, and on Sun Microsystems keyboards, marked as a solid diamond.The key is similar in function to the Macintosh's command key, which has the same location...
-x ucs-insert.
Character mnemonics
RFC 1345 defines a large number (1,893) of suggested mnemonicsMnemonics (keyboard)
A mnemonic is an underlined alphanumeric character, typically appearing in a menu title, menu item, or the text of a button or component of the User interface...
for code points in Unicode 1.0 (as well as characters in ISO 2DIS 10646 and many other character sets in use at the time of publication). Although the document does not restrict the length of a mnemonic (for example, "10000R" for U+2821), most (1,338), of the mnemonics are digraphs, that is, two characters long, and most (416) of the remaining are trigraphs. While never complete, and targeting obsolescent set definitions, the mnemonics themselves can still be used.
- VimVim (text editor)Vim is a text editor written by Bram Moolenaar and first released publicly in 1991. Based on the vi editor common to Unix-like systems, Vim is designed for use both from a command line interface and as a standalone application in a graphical user interface...
allows digraph entry in insert mode (the regular mode for typing text) withCtrl-K
followed by a two-keystroke RFC 1345 mnemonic; or, in addition, if the'digraph'
option is set, by entering the first character followed by a backspace followed by the second character. Custom digraphs can also be defined for arbitrary code points. (For example, "dig Gr 9881" associates "Gr" with .)
- GNU Emacs allows digraph entry by switching to rfc1345 input mode (by default
C-x C-\
).
- GNU ScreenGNU ScreenGNU Screen is a software application that can be used to multiplex several virtual consoles, allowing a user to access multiple separate terminal sessions inside a single terminal window or remote terminal session...
allows digraph entry with (by default) Ctrl-A Ctrl-V.
- Zsh allows digraph entry using the
insert-composed-char
widget.
RFC 1345 predates the introduction of the Euro sign
Euro sign
The euro sign is the currency sign used for the euro, the official currency of the Eurozone in the European Union . The design was presented to the public by the European Commission on 12 December 1996. The international three-letter code for the euro is EUR...
(€, U+20AC), but the above applications included it as digraph "Eu".
Specialized tools
There are several tools that allow quick input of Unicode characters in applications. The input method of the free tool ЮNICODE Keyboard Enhancer uses hotkeys: To type a Unicode character you press and hold the modifier key and then press the selected symbol key. Which physical key takes the function of the modifier key as well the appropriated symbol keys are to be defined by user.Another free tool is UnicodeIt. It converts LaTeX expressions like \alpha into Unicode. On the Mac, this works in most programs (including Keynote and Mail) using a keyboard shortcut. This tool also has an online version at http://www.unicodeit.net which works on most platforms including smart phones.
There is also a free webtool called Shapecatcher that can by used to find Unicode characters by drawing them.
See also
- Numeric character referenceNumeric character referenceA numeric character reference is a common markup construct used in SGML and other SGML-related markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Character Set of Unicode...
- Percent encoding
- Alt codesAlt codesOn personal computers running the Microsoft Windows or DOS operating systems, additional characters to those available in the current keyboard layout can be typed using an Alt code: pressing and holding the Alt key while entering a character code with the keyboard's numeric keypad...
- Compose keyCompose keyA compose key, available on some computer keyboards, is a special kind of modifier key designated to signal the software to interpret the following sequence of two keystrokes as a combination in order to produce a character not found directly on the keyboard...
- Predictive textPredictive textPredictive text is an input technology used where one key or button represents many letters, such as on mobile phones and in accessibility technologies. Each key press results in a prediction rather than repeatedly sequencing through the same group of "letters" it represents, in the same,...
External links
- Unicode Code Converter
- Interactive Unicode Converter
- ЮNICODE Keyboard Enhancer - type Unicode characters in (almost) any Unicode-compatible application
- How to enter Unicode characters in Microsoft Windows