Streaming SIMD Extensions
Encyclopedia
In computing
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...

, Streaming SIMD Extensions (SSE) is a SIMD
SIMD
Single instruction, multiple data , is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously...

 instruction set
Instruction set
An instruction set, or instruction set architecture , is the part of the computer architecture related to programming, including the native data types, instructions, registers, addressing modes, memory architecture, interrupt and exception handling, and external I/O...

 extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III
Pentium III
The Pentium III brand refers to Intel's 32-bit x86 desktop and mobile microprocessors based on the sixth-generation P6 microarchitecture introduced on February 26, 1999. The brand's initial processors were very similar to the earlier Pentium II-branded microprocessors...

 series processors as a reply to AMD's 3DNow!
3DNow!
3DNow! is an extension to the x86 instruction set developed by Advanced Micro Devices . It adds single instruction multiple data instructions to the base x86 instruction set, enabling it to perform simple vector processing, which improves the performance of many graphic-intensive applications...

 (which had debuted a year earlier). SSE contains 70 new instructions, most of which work on single precision floating point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...

 data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing
Digital signal processing
Digital signal processing is concerned with the representation of discrete time signals by a sequence of numbers or symbols and the processing of these signals. Digital signal processing and analog signal processing are subfields of signal processing...

 and graphics processing.

Intel's first IA-32
IA-32
IA-32 , also known as x86-32, i386 or x86, is the CISC instruction-set architecture of Intel's most commercially successful microprocessors, and was first implemented in the Intel 80386 as a 32-bit extension of x86 architecture...

 SIMD effort was the MMX instruction set. MMX had two main problems: it re-used existing floating point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...

 registers making the CPU
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

 unable to work on both floating point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...

 and SIMD data at the same time, and it only worked on integers. SSE floating point instructions operate on a new independent register set (the XMM registers), and it adds a few integer instructions that work on MMX registers.

SSE was subsequently expanded by Intel to SSE2
SSE2
SSE2, Streaming SIMD Extensions 2, is one of the Intel SIMD processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2001. It extends the earlier SSE instruction set, and is intended to fully supplant MMX. Intel extended SSE2 to create SSE3...

, SSE3
SSE3
SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions , is the third iteration of the SSE instruction set for the IA-32 architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU...

, SSSE3
SSSE3
Supplemental Streaming SIMD Extensions 3 is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.- History :...

, and SSE4
SSE4
SSE4 is a CPU instruction set used in the Intel Core microarchitecture and AMD K10 . It was announced on 27 September 2006 at the Fall 2006 Intel Developer Forum, with vague details in a white paper; more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum...

. Because it supports floating point math, it had a wider application than MMX and became more popular. The addition of integer support in SSE2 made MMX largely redundant, though further performance increases can be attained in some situations by using MMX in parallel with SSE operations.

SSE was originally known as KNI for Katmai New Instructions (Katmai being the code name for the first Pentium III
Pentium III
The Pentium III brand refers to Intel's 32-bit x86 desktop and mobile microprocessors based on the sixth-generation P6 microarchitecture introduced on February 26, 1999. The brand's initial processors were very similar to the earlier Pentium II-branded microprocessors...

 core revision). During the Katmai project Intel was looking to distinguish it from their earlier product line, particularly their flagship Pentium II
Pentium II
The Pentium II brand refers to Intel's sixth-generation microarchitecture and x86-compatible microprocessors introduced on May 7, 1997. Containing 7.5 million transistors, the Pentium II featured an improved version of the first P6-generation core of the Pentium Pro, which contained 5.5 million...

. It was later renamed ISSE, for Internet Streaming SIMD Extensions, then SSE. AMD eventually added support for SSE instructions, starting with its Athlon XP and Duron
Duron
The AMD Duron was an x86-compatible microprocessor manufactured by AMD. It was released on June 19, 2000 as a low-cost alternative to AMD's own Athlon processor and the Pentium III and Celeron processor lines from rival Intel...

 (Morgan core) processors.

Registers

SSE originally added eight new 128-bit registers known as XMM0 through XMM7. The AMD64 extensions from AMD (originally called x86-64) added a further eight registers XMM8 through XMM15, and this extension is duplicated in the Intel 64 architecture. There is also a new 32-bit control/status register, MXCSR. The registers XMM8 through XMM15 are accessible only in 64-bit operating mode.
SSE used only a single data type for XMM registers:
  • four 32-bit single-precision floating point numbers


SSE2
SSE2
SSE2, Streaming SIMD Extensions 2, is one of the Intel SIMD processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2001. It extends the earlier SSE instruction set, and is intended to fully supplant MMX. Intel extended SSE2 to create SSE3...

 would later expand the usage of the XMM registers to include:
  • two 64-bit double-precision floating point numbers or
  • two 64-bit integers or
  • four 32-bit integers or
  • eight 16-bit short integers or
  • sixteen 8-bit bytes or characters.


Because these 128-bit registers are additional program states that the operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...

 must preserve across task switches, they are disabled by default until the operating system explicitly enables them. This means that the OS must know how to use the FXSAVE and FXRSTOR instructions, which is the extended pair of instructions which can save all x86 and SSE register states all at once. This support was quickly added to all major IA-32 operating systems.

The first CPU to support SSE, the Pentium III, shared execution resources between SSE and the FPU
Floating point unit
A floating-point unit is a part of a computer system specially designed to carry out operations on floating point numbers. Typical operations are addition, subtraction, multiplication, division, and square root...

. While a compiled application can interleave FPU and SSE instructions side-by-side, the Pentium III will not issue an FPU and an SSE instruction in the same clock-cycle. This limitation reduces the effectiveness of pipelining
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput ....

, but the separate XMM registers do allow SIMD and scalar floating point operations to be mixed without the performance hit from explicit MMX/floating point mode switching.

Floating point instructions

  • Memory-to-register/register-to-memory/register-to-register data movement
    • Scalar
      Scalar (computing)
      In computing, a scalar variable or field is one that can hold only one value at a time; as opposed to composite variables like array, list, hash, record, etc. In some contexts, a scalar value may be understood to be numeric. A scalar data type is the type of a scalar variable...

      – MOVSS
    • Packed
      Data structure alignment
      Data structure alignment is the way data is arranged and accessed in computer memory. It consists of two separate but related issues: data alignment and data structure padding. When a modern computer reads from or writes to a memory address, it will do this in word sized chunks...

       – MOVAPS, MOVUPS, MOVLPS, MOVHPS, MOVLHPS, MOVHLPS
  • Arithmetic
    • Scalar – ADDSS, SUBSS, MULSS, DIVSS, RCPSS, SQRTSS, MAXSS, MINSS, RSQRTSS
    • Packed – ADDPS, SUBPS, MULPS, DIVPS, RCPPS, SQRTPS, MAXPS, MINPS, RSQRTPS
  • Compare
    • Scalar – CMPSS, COMISS, UCOMISS
    • Packed – CMPPS
  • Data shuffle and unpacking
    • Packed – SHUFPS, UNPCKHPS, UNPCKLPS
  • Data-type conversion
    • Scalar – CVTSI2SS, CVTSS2SI, CVTTSS2SI
    • Packed – CVTPI2PS, CVTPS2PI, CVTTPS2PI
  • Bitwise logical operations
    • Packed – ANDPS, ORPS, XORPS, ANDNPS

Integer instructions

  • Arithmetic
    • PMULHUW, PSADBW, PAVGB, PAVGW, PMAXUB, PMINUB, PMAXSW, PMINSW
  • Data movement
    • PEXTRW, PINSRW
  • Other
    • PMOVMSKB, PSHUFW

Other instructions

  • MXCSR management
    • LDMXCSR, STMXCSR
  • Cache and Memory management
    • MOVNTQ, MOVNTPS, MASKMOVQ, PREFETCH0, PREFETCH1, PREFETCH2, PREFETCHNTA, SFENCE

Example

The following simple example demonstrates the advantage of using SSE. Consider an operation like vector addition, which is used very often in computer graphics applications. To add two single precision, four-component vectors together using x86 requires four floating-point addition instructions

vec_res.x = v1.x + v2.x;

vec_res.y = v1.y + v2.y;

vec_res.z = v1.z + v2.z;

vec_res.w = v1.w + v2.w;

This would correspond to four x86 FADD instructions in the object code. On the other hand, as the following pseudo-code shows, a single 128-bit 'packed-add' instruction can replace the four scalar addition instructions.

movaps xmm0, [v1] ;xmm0 = v1.w | v1.z | v1.y | v1.x

addps xmm0, [v2] ;xmm0 = v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x
movaps [vec_res], xmm0

Later versions

  • SSE2
    SSE2
    SSE2, Streaming SIMD Extensions 2, is one of the Intel SIMD processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2001. It extends the earlier SSE instruction set, and is intended to fully supplant MMX. Intel extended SSE2 to create SSE3...

    , introduced with the Pentium 4
    Pentium 4
    Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the...

    , is a major enhancement to SSE. SSE2 adds new math instructions for double-precision (64-bit) floating point and also extends MMX integer instructions to operate on 128-bit XMM registers. Until SSE2, SSE integer instructions introduced with later SSE extensions could still operate on 64-bit MMX registers because the new XMM registers require operating system support. SSE2 enables the programmer to perform SIMD math on any data type (from 8-bit integer to 64-bit float) entirely with the XMM vector-register file, without the need to use the legacy MMX or FPU registers. Many programmers consider SSE2 to be "everything SSE should have been", as SSE2 offers an orthogonal set of instructions
    Orthogonal instruction set
    Orthogonal instruction set is a term used in computer engineering. A computer's instruction set is said to be orthogonal if any instruction can use data of any type via any addressing mode...

     for dealing with common data types.
  • SSE3
    SSE3
    SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions , is the third iteration of the SSE instruction set for the IA-32 architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU...

    , also called Prescott New Instructions (PNI), is an incremental upgrade to SSE2, adding a handful of DSP-oriented mathematics instructions and some process (thread) management instructions.
  • SSSE3
    SSSE3
    Supplemental Streaming SIMD Extensions 3 is a SIMD instruction set created by Intel and is the fourth iteration of the SSE technology.- History :...

     is an incremental upgrade to SSE3, adding 16 new instructions which include permuting the bytes in a word, multiplying 16-bit fixed-point numbers with correct rounding, and within-word accumulate instructions. SSSE3 is often mistaken for SSE4
    SSE4
    SSE4 is a CPU instruction set used in the Intel Core microarchitecture and AMD K10 . It was announced on 27 September 2006 at the Fall 2006 Intel Developer Forum, with vague details in a white paper; more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum...

     as this term was used during the development of the Core microarchitecture
    Microarchitecture
    In computer engineering, microarchitecture , also called computer organization, is the way a given instruction set architecture is implemented on a processor. A given ISA may be implemented with different microarchitectures. Implementations might vary due to different goals of a given design or...

    .
  • SSE4
    SSE4
    SSE4 is a CPU instruction set used in the Intel Core microarchitecture and AMD K10 . It was announced on 27 September 2006 at the Fall 2006 Intel Developer Forum, with vague details in a white paper; more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum...

     is another major enhancement, adding a dot product instruction, additional integer instructions, a popcnt instruction, and more.
  • XOP
    XOP instruction set
    The XOP instruction set, announced by AMD on May 1, 2009, is an extension to the 128-bit SSE core instructions in the x86 and AMD64 instruction set for the Bulldozer processor core, which was released on October 12th, 2011....

    , FMA4
    FMA instruction set
    The FMA instruction set is the name of a future extension to the 128-bit SIMD instructions in the X86 microprocessor instruction set to perform fused multiply–add operations...

     and CVT16
    CVT16 instruction set
    The CVT16 instruction set, announced by AMD on May 1, 2009, is an extension to the 128-bit SSE core instructions in the x86 and AMD64 instruction set.CVT16 is a revision of part of the SSE5 instruction set proposal announced on August 30, 2007...

     are new iterations announced by AMD in August 2007 and revised in May 2009.
  • AVX
    Advanced Vector Extensions
    Advanced Vector Extensions is an extension to the x86 instruction set architecture for microprocessors from Intel and AMD proposed by Intel in March 2008 and first supported by Intel with the Westmere processor shipping in Q1 2011 and now by AMD with the Bulldozer processor shipping in Q3 2011.AVX...

     (Advanced Vector Extensions) is an advanced version of SSE announced by Intel featuring a widened data path from 128 bits to 256 bits and 3-operand instructions (up from 2). Intel microprocessors supporting AVX are expected in 2010. AMD microprocessors supporting AVX are expected in 2011.

Software and hardware issues

With all x86 instruction set extensions, it is up to the BIOS
BIOS
In IBM PC compatible computers, the basic input/output system , also known as the System BIOS or ROM BIOS , is a de facto standard defining a firmware interface....

, operating system and application programmer to test and detect their existence and proper operation.
  • Intel and AMD offer applications to detect what extensions your CPU supports.
  • The CPUID
    CPUID
    The CPUID opcode is a processor supplementary instruction for the x86 architecture. It was introduced by Intel in 1993 when it introduced the Pentium and SL-Enhanced 486 processors....

     opcode is a processor supplementary instruction (its name derived from CPU IDentification) for the x86 architecture. It was introduced by Intel in 1993 when it introduced the Pentium and SL-Enhanced 486 processors.


User application uptake of the x86 extensions has been slow with even bare minimum baseline MMX
MMX
MMX is a single instruction, multiple data instruction set designed by Intel, introduced in 1996 with their P5-based Pentium line of microprocessors, designated as "Pentium with MMX Technology". It developed out of a similar unit introduced on the Intel i860, and earlier the Intel i750 video pixel...

 and SSE support (in some cases) not being supported by applications some 10 years after these extensions became commonly available. Distributed computing has accelerated the use of these extensions in the scientific community—and many scientific applications refuse to run unless the CPU supports SSE2
SSE2
SSE2, Streaming SIMD Extensions 2, is one of the Intel SIMD processor supplementary instruction sets first introduced by Intel with the initial version of the Pentium 4 in 2001. It extends the earlier SSE instruction set, and is intended to fully supplant MMX. Intel extended SSE2 to create SSE3...

 or SSE3
SSE3
SSE3, Streaming SIMD Extensions 3, also known by its Intel code name Prescott New Instructions , is the third iteration of the SSE instruction set for the IA-32 architecture. Intel introduced SSE3 in early 2004 with the Prescott revision of their Pentium 4 CPU...

.

The use of multiple revisions of an application to cope with the many different sets of extensions available is the simplest way around the x86 extension optimization problem. Software libraries and some applications have begun to support multiple extension types hinting that full use of available x86 instructions may finally become common some 5 to 15 years after the instructions were initially introduced.

Related links

Processor ID applications
  • Intel Processor Identification Utility
  • CPU-Z
    CPU-Z
    CPU-Z is a freeware system profiler application for Microsoft Windows that detects the central processing unit, RAM, motherboard chipset, and other hardware features of a modern personal computer, and presents the information in one window.CPU-Z is more in-depth in almost all areas than the tools...

     CPU, motherboard and memory identifier utility
  • BOINC that identifies AMD and Intel x86 features.
The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK