C character classification
Encyclopedia
C character classification is an operation provided by a group of functions in the ANSI C Standard Library
C standard library
The C Standard Library is the standard library for the programming language C, as specified in the ANSI C standard.. It was developed at the same time as the C POSIX library, which is basically a superset of it...

 for the C programming language
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

. These functions are used to test characters for membership in a particular class of characters, such as alphabetic characters, control characters, etc. Both single-byte, and wide characters are supported.

History

Early toolsmiths writing in C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

 under Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

 began developing idioms
Programming idiom
A programming idiom is a means of expressing a recurring construct in one or more programming languages. Generally speaking, a programming idiom is an expression of a simple task or algorithm that is not a built-in feature in the programming language being used, or, conversely, the use of an...

 at a rapid rate to classify characters into different types. For example, in the ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 character set, the following test identifies a letter:

if ('A' <= c && c <= 'Z' || 'a' <= c && c <= 'z')

However, this idiom does not necessarily work for other character sets such as EBCDIC
EBCDIC
Extended Binary Coded Decimal Interchange Code is an 8-bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems....

.

Pretty soon, programs became thick with tests such as the one above, or worse, tests almost like the one above. A programmer can write the same idiom several different ways, which slows comprehension and increases the chance for errors.

Before long, the idioms were replaced by the functions in .

Implementation

Unlike the above example, the character classification routines are not written as comparison tests. In most C libraries, they are written as static table lookups instead of macros or functions.

For example, an array of 256 eight-bit integers, arranged as bitfields, is created, where each bit corresponds to a particular property of the character, e.g., isdigit, isalpha. If the lowest-order bit of the integers corresponds to the isdigit property, the code could be written thus:

#define isdigit(x) (TABLE[x] & 1)

Early versions of Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...

 used a potentially faulty method similar to the first code sample:

#define isdigit(x) ((x) >= '0' && (x) <= '9')

This can cause problems if x has a side effect
Side effect (computer science)
In computer science, a function or expression is said to have a side effect if, in addition to returning a value, it also modifies some state or has an observable interaction with calling functions or the outside world...

---for instance, if one calls isdigit(x++) or isdigit(run_some_program). It would not be immediately evident that the argument to isdigit is being evaluated twice. For this reason, the table-based approach is generally used.

The difference between these two methods became a point of interest during the SCO v. IBM
SCO v. IBM
SCO v. IBM is a civil lawsuit in the United States District Court of Utah. The SCO Group asserted that there are legal uncertainties regarding the use of the Linux operating system due to alleged violations of IBM's Unix licenses in the development of Linux code at IBM.-Summary:On March 6, 2003,...

case.

Overview of functions

The functions that operate on single-byte characters are defined in ctype.h header (cctype header in C++).
The functions that operate on wide characters are defined in wctype.h header (cwctype header in C++).

The classification is done according to the current locale.

Single-byte character manipulationisalnum - checks if a character is alphanumericisalpha - checks if a character is alphabeticislower - checks if a character is lowercaseisupper - checks if a character is an uppercase characterisdigit - checks if a character is a digitisxdigit - checks if a character is a hexadecimal characteriscntrl - checks if a character is a control characterisgraph - checks if a character is a graphical characterisspace - checks if a character is a space characterisblank - (C99/C++11) - checks if a character is a blank characterisprint - checks if a character is a printing characterispunct - checks if a character is a punctuation charactertolower - converts a character to lowercasetoupper - converts a character to uppercase

Wide character manipulationiswalnum - checks if a wide character is alphanumericiswalpha - checks if a wide character is alphabeticiswlower - checks if a wide character is lowercaseiswupper - checks if a wide character is an uppercase characteriswdigit - checks if a wide character is a digitiswxdigit - checks if a wide character is a hexadecimal characteriswcntrl - checks if a wide character is a control characteriswgraph - checks if a wide character is a graphical characteriswspace - checks if a wide character is a space characteriswblank - (C99/C++11) - checks if a wide character is a blank characteriswprint - checks if a wide character is a printing characteriswpunct - checks if a wide character is a punctuation charactertowlower - converts a wide character to lowercasetowupper - converts a wide character to uppercase

Custom wide character manipulationiswctype - checks if a wide character falls into specific classtowctrans - converts a wide character using a specific mappingwctype - returns a character class to be used with iswctypewctrans - returns a transformation mapping to be used with towctrans
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK