String intern pool
Encyclopedia
In computer science, string interning is a method of storing only one copy of each distinct string
value, which must be immutable
. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.
String interning is supported by some modern object-oriented programming language
s, including Python
, Ruby
(with its symbols), Java
and .NET languages.
The single copy of each string is called its 'intern' and is typically looked up by a method of the string class, for example in Java. All compile-time constant strings in Java are automatically interned using this method.
Lisp, Ruby
, and Smalltalk
are among the languages with a symbol type that are basically interned strings.
Interned strings speed up string comparisons, which are sometimes a performance bottleneck in applications (such as compiler
s and dynamic programming language
runtimes) that rely heavily on hash tables with string keys. Without interning, checking that two different strings are equal involves examining every character of both strings. This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of memory, which take time; and the reads fills up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test
suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.
String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a network
or from storage
. Such strings may include magic numbers
or network protocol information. For example, XML parsers may intern names of tags and attributes to save memory.
Objects other than strings can be interned. For example, in Java, when primitive values are boxed into a wrapper object, certain values (any
Modern Lisp dialects typically distinguish symbols from strings; interning a given string returns an existing symbol or creates a new one, whose name is that string. Symbols often have additional properties that strings do not (such as storage for associated values, or namespacing): the distinction is also useful to prevent accidentally comparing an interned string with a not-necessarily-interned string, which could lead to intermittent failures depending on usage patterns.
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....
value, which must be immutable
Immutable object
In object-oriented and functional programming, an immutable object is an object whose state cannot be modified after it is created. This is in contrast to a mutable object, which can be modified after it is created...
. Interning strings makes some string processing tasks more time- or space-efficient at the cost of requiring more time when the string is created or interned. The distinct values are stored in a string intern pool.
String interning is supported by some modern object-oriented programming language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....
s, including Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...
, Ruby
Ruby (programming language)
Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...
(with its symbols), Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...
and .NET languages.
The single copy of each string is called its 'intern' and is typically looked up by a method of the string class, for example in Java. All compile-time constant strings in Java are automatically interned using this method.
Lisp, Ruby
Ruby (programming language)
Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...
, and Smalltalk
Smalltalk
Smalltalk is an object-oriented, dynamically typed, reflective programming language. Smalltalk was created as the language to underpin the "new world" of computing exemplified by "human–computer symbiosis." It was designed and created in part for educational use, more so for constructionist...
are among the languages with a symbol type that are basically interned strings.
Interned strings speed up string comparisons, which are sometimes a performance bottleneck in applications (such as compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...
s and dynamic programming language
Dynamic programming language
Dynamic programming language is a term used broadly in computer science to describe a class of high-level programming languages that execute at runtime many common behaviors that other languages might perform during compilation, if at all...
runtimes) that rely heavily on hash tables with string keys. Without interning, checking that two different strings are equal involves examining every character of both strings. This is slow for several reasons: it is inherently O(n) in the length of the strings; it typically requires reads from several regions of memory, which take time; and the reads fills up the processor cache, meaning there is less cache available for other needs. With interned strings, a simple object identity test
Identity (object-oriented programming)
An identity in object-oriented programming, object-oriented design and object-oriented analysis describes the property of objects that distinguishes them from other objects. This is closely related to the philosophical concept of identity....
suffices after the original intern operation; this is typically implemented as a pointer equality test, normally just a single machine instruction with no memory reference at all.
String interning also reduces memory usage if there are many instances of the same string value; for instance, it is read from a network
Computer network
A computer network, often simply referred to as a network, is a collection of hardware components and computers interconnected by communication channels that allow sharing of resources and information....
or from storage
Computer storage
Computer data storage, often called storage or memory, refers to computer components and recording media that retain digital data. Data storage is one of the core functions and fundamental components of computers....
. Such strings may include magic numbers
Magic number (programming)
In computer programming, the term magic number has multiple meanings. It could refer to one or more of the following:* A constant numerical or text value used to identify a file format or protocol; for files, see List of file signatures...
or network protocol information. For example, XML parsers may intern names of tags and attributes to save memory.
Objects other than strings can be interned. For example, in Java, when primitive values are boxed into a wrapper object, certain values (any
boolean
, any byte
, any char
from 0 to 127, and any short
or int
between -128 and 127) are interned, and any two boxing conversions of one of these values are guaranteed to result in the same object.History
Lisp introduced the notion of interned strings for its symbols. Historically, the data structure used as a string intern pool was called an 'oblist' (when it was implemented as a linked list) or an 'obarray' (when it was implemented as an array).Modern Lisp dialects typically distinguish symbols from strings; interning a given string returns an existing symbol or creates a new one, whose name is that string. Symbols often have additional properties that strings do not (such as storage for associated values, or namespacing): the distinction is also useful to prevent accidentally comparing an interned string with a not-necessarily-interned string, which could lead to intermittent failures depending on usage patterns.