Simultaneous multithreading
Encyclopedia
Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar
CPUs
with hardware multithreading. SMT permits multiple independent thread
s of execution to better utilize the resources provided by modern processor architecture
s.
level of execution in modern superscalar
processors.
Simultaneous multithreading (SMT) is one of the two main implementations of multithreading, the other form being temporal multithreading
. In temporal multithreading, only one thread of instructions can execute in any given pipeline stage at a time. In simultaneous multithreading, instructions from more than one thread can be executing in any given pipeline stage at a time. This is done without great changes to the basic processor architecture: the main additions needed are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads. The number of concurrent threads can be decided by the chip designers, but practical restrictions on chip complexity have limited the number to two for most SMT implementations, though there have been as many as 8 threads per core in, for example, the UltraSPARC T2
.
Because the technique is really an efficiency solution and there is inevitable increased conflict on shared resources, measuring or agreeing on the effectiveness of the solution can be difficult. Some researchers have shown that the extra threads can be used to proactively seed a shared resource like a cache, to improve the performance of another single thread, and claim this shows that SMT is not just an efficiency solution. Others use SMT to provide redundant computation, for some level of error detection and recovery.
However, in most current cases, SMT is about hiding memory latency, increasing efficiency, and increasing throughput of computations per amount of hardware used.
technique which tries to increase instruction level parallelism (ILP), the other is multithreading approach exploiting thread level parallelism (TLP).
Superscalar means executing multiple instructions at the same time while chip-level multithreading (CMT) executes instructions from multiple threads within one processor chip at the same time. There are many ways to support more than one thread within a chip, namely:
The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1
(known as "Niagara" until its November 14, 2005 release) is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time.
(EV8). This microprocessor was developed by DEC
in coordination with Dean Tullsen of the University of California, San Diego, and Susan Eggers and Hank Levy of the University of Washington. The microprocessor was never released, since the Alpha line of microprocessors was discontinued shortly before HP
acquired Compaq
which had in turn acquired DEC
. Dean Tullsen's work was also used to develop the "Hyper-threading
" (or "HTT") versions of the Intel Pentium 4 microprocessors, such as the "Northwood" and "Prescott".
was the first modern desktop processor to implement simultaneous multithreading, starting from the 3.06 GHz model released in 2002, and since introduced into a number of their processors. Intel calls the functionality Hyper-Threading Technology
(HTT), and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement compared against an otherwise identical, non-SMT Pentium 4. The performance improvement seen is very application dependent, and some programs actually slow down slightly when HTT is turned on due to increased contention for resources such as bandwidth, caches, TLB
s, re-order buffer
entries, etc. This is generally the case for poorly written data access routines that cause high latency intercache transactions (cache thrashing) on multi-processor systems. Programs written before multiprocessor and multicore designs were prevalent commonly did not optimize cache access because on a single CPU system there is only a single cache which is always coherent with itself. On a multiprocessor system each CPU or core will typically have its own cache, which is interlinked with the cache of other CPU/cores in the system to maintain cache coherency. If thread A accesses a memory location [00] and thread B then accesses memory location [01] it can cause an intercache transaction particularly where the cache line fill exceeds 2 bytes, as is the case for all modern processors.
The latest MIPS architecture
designs include an SMT system known as "MIPS MT" . MIPS MT provides for both heavyweight virtual processing elements and lighter-weight hardware microthreads. RMI
, a Cupertino-based startup, is the first MIPS vendor to provide a processor SOC
based on 8 cores, each of which runs 4 threads. The threads can be run in fine-grain mode where a different thread can be executed each cycle. The threads can also be assigned priorities.
The IBM
POWER5
, announced in May 2004, comes as either a dual core DCM, or quad-core or oct-core MCM, with each core including a two-thread SMT engine. IBM's implementation is more sophisticated than the previous ones, because it can assign a different priority to the various threads, is more fine-grained, and the SMT engine can be turned on and off dynamically, to better execute those workloads where an SMT processor would not increase performance. This is IBM's second implementation of generally available hardware multithreading. In 2010, IBM released systems based on the POWER7 processor with 8 cores with each having four Simultaneous Intelligent Threads. This switches the threading mode between one thread, two threads or four threads depending on the number of process threads being scheduled at the time. This optimizes the use of the core for minimum response time or maximum throughput.
Although many people reported that Sun Microsystems
' UltraSPARC T1
(known as "Niagara" until its 14 November 2005 release) and the now defunct processor codenamed "Rock
" (originally announced in 2005, but after many delays cancelled in 2009) are implementations of SPARC
focused almost entirely on exploiting SMT and CMP
techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as "CMT", and the overall concept as "Throughput Computing". The Niagara has 8 cores, but each core has only one pipeline, so actually it uses fine-grained multithreading. Unlike SMT, where instructions from multiple threads share the issue window each cycle, the processor uses a round robin policy to issue instructions from the next active thread each cycle. This makes it more similar to a barrel processor
. Sun Microsystems
' Rock processor
is different, it has more complex cores that have more than one pipeline.
The Intel Atom
, released in 2008, is the first Intel product to feature SMT (marketed as Hyper-Threading) without supporting instruction reordering, speculative execution, or register renaming. Intel reintroduced Hyper-Threading with the Nehalem microarchitecture, after its absence on the Core microarchitecture.
calls for this purpose and for preventing processes with different priority from taking resources from each other.
There is also a security concern with certain simultaneous multithreading implementations. Intel's hyperthreading implementation has a vulnerability through which it is possible for one application to steal a cryptographic key
from another application running in the same processor by monitoring its cache use.
There is also a disadvantage if you want to use a PC for 100% with maximum performance, solving a single problem.
Superscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...
CPUs
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...
with hardware multithreading. SMT permits multiple independent thread
Thread (computer science)
In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...
s of execution to better utilize the resources provided by modern processor architecture
CPU design
CPU design is the design engineering task of creating a central processing unit , a component of computer hardware. It is a subfield of electronics engineering and computer engineering.- Overview :CPU design focuses on these areas:...
s.
Details
Multithreading is similar in concept to preemptive multitasking but is implemented at the threadThread (computer science)
In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...
level of execution in modern superscalar
Superscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...
processors.
Simultaneous multithreading (SMT) is one of the two main implementations of multithreading, the other form being temporal multithreading
Temporal multithreading
Temporal multithreading is one of the two main forms of multithreading that can be implemented on computer processor hardware, the other being simultaneous multithreading. The distinguishing difference between the two forms is the maximum number of concurrent threads that can execute in any given...
. In temporal multithreading, only one thread of instructions can execute in any given pipeline stage at a time. In simultaneous multithreading, instructions from more than one thread can be executing in any given pipeline stage at a time. This is done without great changes to the basic processor architecture: the main additions needed are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads. The number of concurrent threads can be decided by the chip designers, but practical restrictions on chip complexity have limited the number to two for most SMT implementations, though there have been as many as 8 threads per core in, for example, the UltraSPARC T2
UltraSPARC T2
Sun Microsystems' UltraSPARC T2 microprocessor is a multithreading, multi-core CPU. It is a member of the SPARC family, and the successor to the UltraSPARC T1. The chip is sometimes referred to by its codename, Niagara 2...
.
Because the technique is really an efficiency solution and there is inevitable increased conflict on shared resources, measuring or agreeing on the effectiveness of the solution can be difficult. Some researchers have shown that the extra threads can be used to proactively seed a shared resource like a cache, to improve the performance of another single thread, and claim this shows that SMT is not just an efficiency solution. Others use SMT to provide redundant computation, for some level of error detection and recovery.
However, in most current cases, SMT is about hiding memory latency, increasing efficiency, and increasing throughput of computations per amount of hardware used.
Taxonomy
In processor design, there are two ways to increase on-chip parallelism with less resource requirements: one is superscalarSuperscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...
technique which tries to increase instruction level parallelism (ILP), the other is multithreading approach exploiting thread level parallelism (TLP).
Superscalar means executing multiple instructions at the same time while chip-level multithreading (CMT) executes instructions from multiple threads within one processor chip at the same time. There are many ways to support more than one thread within a chip, namely:
- Interleaved multithreading: Interleaved issue of multiple instructions from different threads, also referred to as Temporal multithreadingTemporal multithreadingTemporal multithreading is one of the two main forms of multithreading that can be implemented on computer processor hardware, the other being simultaneous multithreading. The distinguishing difference between the two forms is the maximum number of concurrent threads that can execute in any given...
. It can be further divided into fine-grain multithreading or coarse-grain multithreading depending on the frequency of interleaved issues. Fine-grain multithreading—such as in a barrel processorBarrel processorA barrel processor is a CPU that switches between threads of execution on every cycle. This CPU design technique is also known as "interleaved" or "fine-grained" temporal multithreading...
-- issues instructions for different threads after every cycle, while coarse-grain multithreading only switches to issue instructions from another thread when the current executing thread causes some long latency events (like page fault etc.). Coarse-grain multithreading is more common for less context switch between threads. For example, Intel's MontecitoMontecito (processor)Montecito is the code-name of a major release of Intel's Itanium 2 Processor Family , which implements the Intel Itanium architecture on a dual-core processor. It was officially launched by Intel on July 18, 2006 as the "Dual-Core Intel Itanium 2 processor"...
processor uses coarse-grain multithreading, while Sun's UltraSPARC T1UltraSPARC T1|right|262px|UltraSPARC T1 processorSun Microsystems' UltraSPARC T1 microprocessor, known until its 14 November 2005 announcement by its development codename "Niagara", is a multithreading, multicore CPU...
uses fine-grain multithreading. For those processors that have only one pipeline per core, interleaved multithreading is the only possible way, because it can issue at most one instruction per cycle. - Simultaneous multithreading (SMT): Issue multiple instructions from multiple threads in one cycle. The processor must be superscalar to do so.
- Chip-level multiprocessing (CMP or multicoreMulti-core (computing)A multi-core processor is a single computing component with two or more independent actual processors , which are the units that read and execute program instructions...
): integrates two or more processors into one chip, each executing threads independently - Any combination of multithreaded/SMT/CMP
The key factor to distinguish them is to look at how many instructions the processor can issue in one cycle and how many threads from which the instructions come. For example, Sun Microsystems' UltraSPARC T1
UltraSPARC T1
|right|262px|UltraSPARC T1 processorSun Microsystems' UltraSPARC T1 microprocessor, known until its 14 November 2005 announcement by its development codename "Niagara", is a multithreading, multicore CPU...
(known as "Niagara" until its November 14, 2005 release) is a multicore processor combined with fine-grain multithreading technique instead of simultaneous multithreading because each core can only issue one instruction at a time.
Historical implementations
While multithreading CPUs have been around since the 1950s, simultaneous multithreading was first researched by IBM in 1968. The first major commercial microprocessor developed with SMT was the Alpha 21464Alpha 21464
The Alpha 21464 is an unfinished microprocessor that implements the Alpha instruction set architecture developed by Digital Equipment Corporation and later by Compaq after it acquired Digital. The microprocessor was also known as EV8 or Araña, the latter being its code-name...
(EV8). This microprocessor was developed by DEC
Digital Equipment Corporation
Digital Equipment Corporation was a major American company in the computer industry and a leading vendor of computer systems, software and peripherals from the 1960s to the 1990s...
in coordination with Dean Tullsen of the University of California, San Diego, and Susan Eggers and Hank Levy of the University of Washington. The microprocessor was never released, since the Alpha line of microprocessors was discontinued shortly before HP
Hewlett-Packard
Hewlett-Packard Company or HP is an American multinational information technology corporation headquartered in Palo Alto, California, USA that provides products, technologies, softwares, solutions and services to consumers, small- and medium-sized businesses and large enterprises, including...
acquired Compaq
Compaq
Compaq Computer Corporation is a personal computer company founded in 1982. Once the largest supplier of personal computing systems in the world, Compaq existed as an independent corporation until 2002, when it was acquired for US$25 billion by Hewlett-Packard....
which had in turn acquired DEC
Digital Equipment Corporation
Digital Equipment Corporation was a major American company in the computer industry and a leading vendor of computer systems, software and peripherals from the 1960s to the 1990s...
. Dean Tullsen's work was also used to develop the "Hyper-threading
Hyper-threading
Hyper-threading is Intel's term for its simultaneous multithreading implementation in its Atom, Intel Core i3/i5/i7, Itanium, Pentium 4 and Xeon CPUs....
" (or "HTT") versions of the Intel Pentium 4 microprocessors, such as the "Northwood" and "Prescott".
Modern commercial implementations
The Intel Pentium 4Pentium 4
Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the...
was the first modern desktop processor to implement simultaneous multithreading, starting from the 3.06 GHz model released in 2002, and since introduced into a number of their processors. Intel calls the functionality Hyper-Threading Technology
Hyper-threading
Hyper-threading is Intel's term for its simultaneous multithreading implementation in its Atom, Intel Core i3/i5/i7, Itanium, Pentium 4 and Xeon CPUs....
(HTT), and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement compared against an otherwise identical, non-SMT Pentium 4. The performance improvement seen is very application dependent, and some programs actually slow down slightly when HTT is turned on due to increased contention for resources such as bandwidth, caches, TLB
Translation Lookaside Buffer
A translation lookaside buffer is a CPU cache that memory management hardware uses to improve virtual address translation speed. All current desktop and server processors use a TLB to map virtual and physical address spaces, and it is ubiquitous in any hardware which utilizes virtual memory.The...
s, re-order buffer
Re-order buffer
A re-order buffer is used in a Tomasulo algorithm for out-of-order instruction execution. It allows instructions to be committed in-order....
entries, etc. This is generally the case for poorly written data access routines that cause high latency intercache transactions (cache thrashing) on multi-processor systems. Programs written before multiprocessor and multicore designs were prevalent commonly did not optimize cache access because on a single CPU system there is only a single cache which is always coherent with itself. On a multiprocessor system each CPU or core will typically have its own cache, which is interlinked with the cache of other CPU/cores in the system to maintain cache coherency. If thread A accesses a memory location [00] and thread B then accesses memory location [01] it can cause an intercache transaction particularly where the cache line fill exceeds 2 bytes, as is the case for all modern processors.
The latest MIPS architecture
MIPS architecture
MIPS is a reduced instruction set computer instruction set architecture developed by MIPS Technologies . The early MIPS architectures were 32-bit, and later versions were 64-bit...
designs include an SMT system known as "MIPS MT" . MIPS MT provides for both heavyweight virtual processing elements and lighter-weight hardware microthreads. RMI
RMI Corporation
RMI Corporation, also known as RMI, formerly known as Raza Microelectronics, Inc., is a privately held Fabless semiconductor company headquartered in Cupertino, California, which specializes in designing System-on-a-chip processors for networking and consumer media applications.- History :RMI was...
, a Cupertino-based startup, is the first MIPS vendor to provide a processor SOC
System-on-a-chip
A system on a chip or system on chip is an integrated circuit that integrates all components of a computer or other electronic system into a single chip. It may contain digital, analog, mixed-signal, and often radio-frequency functions—all on a single chip substrate...
based on 8 cores, each of which runs 4 threads. The threads can be run in fine-grain mode where a different thread can be executed each cycle. The threads can also be assigned priorities.
The IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
POWER5
POWER5
The POWER5 is a microprocessor developed and fabricated by IBM. It is an improved version of the highly successful POWER4. The principal improvements are support for simultaneous multithreading and an on-die memory controller...
, announced in May 2004, comes as either a dual core DCM, or quad-core or oct-core MCM, with each core including a two-thread SMT engine. IBM's implementation is more sophisticated than the previous ones, because it can assign a different priority to the various threads, is more fine-grained, and the SMT engine can be turned on and off dynamically, to better execute those workloads where an SMT processor would not increase performance. This is IBM's second implementation of generally available hardware multithreading. In 2010, IBM released systems based on the POWER7 processor with 8 cores with each having four Simultaneous Intelligent Threads. This switches the threading mode between one thread, two threads or four threads depending on the number of process threads being scheduled at the time. This optimizes the use of the core for minimum response time or maximum throughput.
Although many people reported that Sun Microsystems
Sun Microsystems
Sun Microsystems, Inc. was a company that sold :computers, computer components, :computer software, and :information technology services. Sun was founded on February 24, 1982...
' UltraSPARC T1
UltraSPARC T1
|right|262px|UltraSPARC T1 processorSun Microsystems' UltraSPARC T1 microprocessor, known until its 14 November 2005 announcement by its development codename "Niagara", is a multithreading, multicore CPU...
(known as "Niagara" until its 14 November 2005 release) and the now defunct processor codenamed "Rock
Rock processor
Rock was a multithreading, multicore, SPARC microprocessor developed at Sun Microsystems. Now canceled, it was a separate development from the CoolThreads/Niagara family of processors....
" (originally announced in 2005, but after many delays cancelled in 2009) are implementations of SPARC
SPARC
SPARC is a RISC instruction set architecture developed by Sun Microsystems and introduced in mid-1987....
focused almost entirely on exploiting SMT and CMP
Multi-core (computing)
A multi-core processor is a single computing component with two or more independent actual processors , which are the units that read and execute program instructions...
techniques, Niagara is not actually using SMT. Sun refers to these combined approaches as "CMT", and the overall concept as "Throughput Computing". The Niagara has 8 cores, but each core has only one pipeline, so actually it uses fine-grained multithreading. Unlike SMT, where instructions from multiple threads share the issue window each cycle, the processor uses a round robin policy to issue instructions from the next active thread each cycle. This makes it more similar to a barrel processor
Barrel processor
A barrel processor is a CPU that switches between threads of execution on every cycle. This CPU design technique is also known as "interleaved" or "fine-grained" temporal multithreading...
. Sun Microsystems
Sun Microsystems
Sun Microsystems, Inc. was a company that sold :computers, computer components, :computer software, and :information technology services. Sun was founded on February 24, 1982...
' Rock processor
Rock processor
Rock was a multithreading, multicore, SPARC microprocessor developed at Sun Microsystems. Now canceled, it was a separate development from the CoolThreads/Niagara family of processors....
is different, it has more complex cores that have more than one pipeline.
The Intel Atom
Intel Atom
Intel Atom is the brand name for a line of ultra-low-voltage x86 and x86-64 CPUs from Intel, designed in 45 nm CMOS and used mainly in netbooks, nettops, embedded application ranging from health care to advanced robotics and Mobile Internet devices...
, released in 2008, is the first Intel product to feature SMT (marketed as Hyper-Threading) without supporting instruction reordering, speculative execution, or register renaming. Intel reintroduced Hyper-Threading with the Nehalem microarchitecture, after its absence on the Core microarchitecture.
Disadvantages
Simultaneous multithreading cannot improve performance if any of the shared resources are limiting bottlenecks for the performance. In fact, some applications run slower when simultaneous multithreading is enabled. Critics argue that it is a considerable burden to put on software developers that they have to test whether simultaneous multithreading is good or bad for their application in various situations and insert extra logic to turn it off if it decreases performance. Current operating systems lack convenient APIApplication programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...
calls for this purpose and for preventing processes with different priority from taking resources from each other.
There is also a security concern with certain simultaneous multithreading implementations. Intel's hyperthreading implementation has a vulnerability through which it is possible for one application to steal a cryptographic key
Key (cryptography)
In cryptography, a key is a piece of information that determines the functional output of a cryptographic algorithm or cipher. Without a key, the algorithm would produce no useful result. In encryption, a key specifies the particular transformation of plaintext into ciphertext, or vice versa...
from another application running in the same processor by monitoring its cache use.
There is also a disadvantage if you want to use a PC for 100% with maximum performance, solving a single problem.
See also
- Temporal multithreadingTemporal multithreadingTemporal multithreading is one of the two main forms of multithreading that can be implemented on computer processor hardware, the other being simultaneous multithreading. The distinguishing difference between the two forms is the maximum number of concurrent threads that can execute in any given...
, another implementation of hardware multithreading - Thread (computer science)Thread (computer science)In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...
, the fundamental software entity scheduled by the operating system kernel to execute on a CPU or processor (core) - Symmetric multiprocessingSymmetric multiprocessingIn computing, symmetric multiprocessing involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared main memory and are controlled by a single OS instance. Most common multiprocessor systems today use an SMP architecture...
, where the system (or partition of a larger computer hardware platform) contains more than one CPU or processor (core) and where the operating system kernel is not limited to which of the available CPUs (cores) a given thread can be scheduled to execute on