C string handling
Encyclopedia
C string handling refers to a group of functions implementing operations on strings in the C Standard Library
C standard library
The C Standard Library is the standard library for the programming language C, as specified in the ANSI C standard.. It was developed at the same time as the C POSIX library, which is basically a superset of it...

. Various operations, such as copying, concatenation, tokenization and searching are supported.

The only support in the C programming language itself for strings is that the compiler will translate a quoted string constant in the source into a null-terminated string
Null-terminated string
In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character...

 stored in static memory.

However the standard C library provides a large number of functions designed to manipulate these null-terminated strings. These functions are so popular and used so often that they are usually considered part of the definition of C.

Definitions

The term string is used in C to describe a contiguous sequence of characters terminated by and including the first null byte. A common misconception is that a string is an array, because string literals are converted to arrays during the compilation (or translation) phase. It is important to remember that a string ends at the first NUL byte. An array or string literal that contains a null byte before the last byte therefore contains a string, or possibly several strings, but is not itself a string.

The term pointer to a string is used in C to describe a pointer to the initial (lowest-addressed) byte of a string. As pointers are used to pass a reference to a string to functions in C, documentation (including this page) will often use the term string when correct notation is to say pointer to string.

The term length of a string is used in C to describe the number of bytes preceding the null character. strlen is a standardised function commonly used to determine the length of a string.

Character encodings

Being null-terminated, C strings require that the 0x00 byte is not used for any character. This allows any ASCII extension
Extended ASCII
The term extended ASCII describes eight-bit or larger character encodings that include the standard seven-bit ASCII characters as well as others...

 to be used.

UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 works well with char strings, and using a UTF-8 editor, one can write char foo[512] = "φωωβαρ";
Truncating strings with variable length characters (UTF-8, Shift JIS etc) using functions like strncpy or strncat can produce invalid characters at the end. This can be unsafe if the truncated parts are not fixed before being interpreted by code that assumes the input is valid.

But char strings can not contain UTF-16, since UTF-16 uses 2-byte integers which for example stores space as 0x0020. Equivalent null-terminated strings made of wchar t can be used, ending with a 0-valued wchar_t. C has added a source code syntax for 2-byte strings such as UTF-16, using an L first: wchar_t foo[512] = L"φωωβαρ";. There is a set of functions available for wchar_t strings, declared in wchar.h.

Overview of functions

Most of the functions that operate on C strings are defined in the string.h (cstring header in C++). This header contains declarations of functions and types used for handling C strings and memory buffers; the name is thus something of a misnomer.

Functions declared in string.h are extremely popular since, as a part of the C standard library
C standard library
The C Standard Library is the standard library for the programming language C, as specified in the ANSI C standard.. It was developed at the same time as the C POSIX library, which is basically a superset of it...

, they are guaranteed to work on any platform which supports C. However, some security issues exist with these functions, such as buffer overflow
Buffer overflow
In computer security and programming, a buffer overflow, or buffer overrun, is an anomaly where a program, while writing data to a buffer, overruns the buffer's boundary and overwrites adjacent memory. This is a special case of violation of memory safety....

s, leading programmers to prefer safer, possibly less portable variants. Also, the string functions only work with character encodings made of bytes, such as ASCII
ASCII
The American Standard Code for Information Interchange is a character-encoding scheme based on the ordering of the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that use text...

 and UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

. In historical documentation the term "character" was often used instead of "byte", which if followed literally would mean that multi-byte encodings such as UTF-8
UTF-8
UTF-8 is a multibyte character encoding for Unicode. Like UTF-16 and UTF-32, UTF-8 can represent every character in the Unicode character set. Unlike them, it is backward-compatible with ASCII and avoids the complications of endianness and byte order marks...

 were not supported. The BSD documentation has been fixed to make this clear, but POSIX, Linux, and Windows documentation still uses "character" in many places. Functions to handle character encodings made up of code units larger than bytes, such as UTF-16, are generally achieved through wchar.h
Wchar.h
C wide string handling refers to a group of functions implementing operations on wide strings in the C Standard Library. Various operations, such as copying, concatenation, tokenization and searching are supported....

.

Constants and types

Name Notes
NULL macro expanding to the null pointer constant; that is, a constant representing a pointer value which is guaranteed not to be a valid address of an object in memory.
size_t an unsigned integer
Integer
The integers are formed by the natural numbers together with the negatives of the non-zero natural numbers .They are known as Positive and Negative Integers respectively...

 type which is the type of the result of the sizeof
Sizeof
In the programming languages C and C++, the unary operator sizeof is used to calculate the sizes of datatypes, in number of bytes. A byte in this context is the same as an unsigned char, and may be larger than the standard 8 bits, although that is uncommon in modern implementations...

operator. It may be 32 or 64 bits wide, depending on the platform.

Functions

String manipulation - copies one string to another - writes exactly n bytes to a string, copying from src or adding nulls - appends one string to another - appends no more than n bytes from one string to another - transforms a string according to the current locale

String examination - returns the length of a string - compares two strings - compares a specific number of bytes in two strings - compares two strings according to the current locale - finds the first occurrence of a byte in a string - finds the last occurrence of a byte in a string - finds in a string the first occurrence of a byte not in a set of bytes - finds in a string the last occurrence of a byte not in a set of bytes - finds in a string the first occurrence of a byte in a set of bytes - finds in a string the first occurrence of a substring - finds in a string the next occurrence of a token

Miscellaneous - generates and reports a string containing an error message derived from the given error code. The strerror function is not reentrant.

Memory manipulation - fills a buffer with a repeated byte - copies one buffer to another - copies one buffer to another, possibly overlapping, buffer - compares two buffers - finds the first occurrence of a byte in a buffer

Numeric conversions

The C standard library contains several functions for numeric conversions. They are all defined in the stdlib.h header (cstdlib header in C++).
atof - converts a string to a floating-point valueatoi, atol, atoll (C99/C++11) - converts a string to an integerstrtof(C99/C++11), strtod, strtold(C99/C++11) - converts a string to a double-precision floating-point valuestrtol, strtoll - converts a string to a signed integerstrtoul, strtoull - converts a string to an unsigned integer

Popular extensions

memccpy - SVID, POSIX
POSIX
POSIX , an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems...

 - copies up to specified number of bytes between two memory areas, which must not overlap, stopping when a given byte is foundmempcpy - GNU
GNU
GNU is a Unix-like computer operating system developed by the GNU project, ultimately aiming to be a "complete Unix-compatible software system"...

 - a variant of memcpy returning a pointer to the byte following the last written bytestrcat_s - ISO/IEC WDTR 24731 - a variant of strcat that checks for errors, such as the destination buffer being too small, before copyingstrcpy_s - ISO/IEC WDTR 24731 - a variant of strcpy that checks for errors, such as the destination buffer being too small, before copyingstrdup - POSIX - allocates and duplicates a stringstrerror_r - POSIX 1 - a variant of strerror that is thread-safestrlcpy - a variant of strcpy that truncates the copied string if the destination buffer is too smallstrlcat - a variant of strcat that truncates the appended string if the destination buffer is too smallstrsignal - POSIX:2008
POSIX
POSIX , an acronym for "Portable Operating System Interface", is a family of standards specified by the IEEE for maintaining compatibility between operating systems...

 - returns string representation of a signal code
Signal (computing)
A signal is a limited form of inter-process communication used in Unix, Unix-like, and other POSIX-compliant operating systems. Essentially it is an asynchronous notification sent to a process in order to notify it of an event that occurred. When a signal is sent to a process, the operating system...

. Not thread safestrtok - POSIX - a variant of strtok r that is thread-safe

Criticism

strcat_s and strcpy_s attracted considerable criticism because even though they are defined in the ISO/IEC WDTR 24731 standard, they are currently supported only by Microsoft Visual C++. Warning messages produced by Microsoft's compilers suggesting programmers use these functions instead of standard ones have been speculated by some to be a Microsoft attempt to lock developers to its platform.

strlcpy and strlcat have been criticised on the basis that they create more problems than they solve and lacking documentation. Consequently they have not been included in GNU C library
GNU C Library
The GNU C Library, commonly known as glibc, is the C standard library released by the GNU Project. Originally written by the Free Software Foundation for the GNU operating system, the library's development has been overseen by a committee since 2001, with Ulrich Drepper from Red Hat as the lead...

 on Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...

 even though OpenBSD
OpenBSD
OpenBSD is a Unix-like computer operating system descended from Berkeley Software Distribution , a Unix derivative developed at the University of California, Berkeley. It was forked from NetBSD by project leader Theo de Raadt in late 1995...

, FreeBSD
FreeBSD
FreeBSD is a free Unix-like operating system descended from AT&T UNIX via BSD UNIX. Although for legal reasons FreeBSD cannot be called “UNIX”, as the direct descendant of BSD UNIX , FreeBSD’s internals and system APIs are UNIX-compliant...

, Solaris
Solaris Operating System
Solaris is a Unix operating system originally developed by Sun Microsystems. It superseded their earlier SunOS in 1993. Oracle Solaris, as it is now known, has been owned by Oracle Corporation since Oracle's acquisition of Sun in January 2010....

, Mac OS X
Mac OS X
Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...

, and even the Linux kernel itself implement it.

See also

  • C syntax#Strings for source code syntax, including backslash escape sequences.
  • String functions
  • Null-terminated string
    Null-terminated string
    In computer programming, a null-terminated string is a character string stored as an array containing the characters and terminated with a null character...

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK