Cell microprocessor
Encyclopedia
Cell is a microprocessor
architecture jointly developed by Sony
, Sony Computer Entertainment
, Toshiba
, and IBM
, an alliance known as "STI". The architectural design and first implementation were carried out at the STI Design Center in Austin, Texas
over a four-year period beginning March 2001 on a budget reported by Sony as approaching US$400 million. Cell is shorthand for Cell Broadband Engine Architecture, commonly abbreviated CBEA in full or Cell BE in part. Cell combines a general-purpose Power Architecture
core
of modest performance with streamlined coprocessing
elements which greatly accelerate multimedia
and vector processing applications, as well as many other forms of dedicated computation.
The first major commercial application of Cell was in Sony's PlayStation 3
game console
. Mercury Computer Systems
has a dual Cell server, a dual Cell blade
configuration, a rugged computer, and a PCI Express accelerator board available in different stages of production. Toshiba had announced plans to incorporate Cell in high definition
television sets, but have seemed to abandon the idea. Exotic features such as the XDR
memory subsystem and coherent Element Interconnect Bus (EIB) interconnect appear to position Cell for future applications in the supercomputing space to exploit the Cell processor's prowess in floating point
kernels.
The Cell architecture includes a memory coherence
architecture that emphasizes efficiency/watt, prioritizes bandwidth
over latency
, and favors peak computational throughput
over simplicity of program code. For these reasons, Cell is widely regarded as a challenging environment for software development
. IBM provides a comprehensive Linux
-based Cell development platform to assist developers in confronting these challenges. Software adoption remains a key issue in whether Cell ultimately delivers on its performance potential. Despite those challenges, research has indicated that Cell excels at several types of scientific computation.
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...
architecture jointly developed by Sony
Sony
, commonly referred to as Sony, is a Japanese multinational conglomerate corporation headquartered in Minato, Tokyo, Japan and the world's fifth largest media conglomerate measured by revenues....
, Sony Computer Entertainment
Sony Computer Entertainment
Sony Computer Entertainment, Inc. is a major video game company specializing in a variety of areas in the video game industry, and is a wholly owned subsidiary and part of the Consumer Products & Services Group of Sony...
, Toshiba
Toshiba
is a multinational electronics and electrical equipment corporation headquartered in Tokyo, Japan. It is a diversified manufacturer and marketer of electrical products, spanning information & communications equipment and systems, Internet-based solutions and services, electronic components and...
, and IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
, an alliance known as "STI". The architectural design and first implementation were carried out at the STI Design Center in Austin, Texas
Austin, Texas
Austin is the capital city of the U.S. state of :Texas and the seat of Travis County. Located in Central Texas on the eastern edge of the American Southwest, it is the fourth-largest city in Texas and the 14th most populous city in the United States. It was the third-fastest-growing large city in...
over a four-year period beginning March 2001 on a budget reported by Sony as approaching US$400 million. Cell is shorthand for Cell Broadband Engine Architecture, commonly abbreviated CBEA in full or Cell BE in part. Cell combines a general-purpose Power Architecture
Power Architecture
Power Architecture is a broad term to describe similar RISC instruction sets for microprocessors developed and manufactured by such companies as IBM, Freescale, AMCC, Tundra and P.A. Semi...
core
Multi-core (computing)
A multi-core processor is a single computing component with two or more independent actual processors , which are the units that read and execute program instructions...
of modest performance with streamlined coprocessing
Coprocessor
A coprocessor is a computer processor used to supplement the functions of the primary processor . Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, string processing, or encryption. By offloading processor-intensive tasks from the main processor,...
elements which greatly accelerate multimedia
Multimedia
Multimedia is media and content that uses a combination of different content forms. The term can be used as a noun or as an adjective describing a medium as having multiple content forms. The term is used in contrast to media which use only rudimentary computer display such as text-only, or...
and vector processing applications, as well as many other forms of dedicated computation.
The first major commercial application of Cell was in Sony's PlayStation 3
PlayStation 3
The is the third home video game console produced by Sony Computer Entertainment and the successor to the PlayStation 2 as part of the PlayStation series. The PlayStation 3 competes with Microsoft's Xbox 360 and Nintendo's Wii as part of the seventh generation of video game consoles...
game console
Video game console
A video game console is an interactive entertainment computer or customized computer system that produces a video display signal which can be used with a display device to display a video game...
. Mercury Computer Systems
Mercury Computer Systems
Mercury Computer Systems, Inc. provides high-performance embedded, real-time digital signal and image processing solutions.Mercury designs and builds embedded multicomputers, which may be considered to be either loosely coupled NUMA computers or tightly coupled clusters. Despite being marketed as...
has a dual Cell server, a dual Cell blade
Blade server
A blade server is a stripped down server computer with a modular design optimized to minimize the use of physical space and energy. Whereas a standard rack-mount server can function with a power cord and network cable, blade servers have many components removed to save space, minimize power...
configuration, a rugged computer, and a PCI Express accelerator board available in different stages of production. Toshiba had announced plans to incorporate Cell in high definition
High-definition television
High-definition television is video that has resolution substantially higher than that of traditional television systems . HDTV has one or two million pixels per frame, roughly five times that of SD...
television sets, but have seemed to abandon the idea. Exotic features such as the XDR
XDR DRAM
XDR DRAM or extreme data rate dynamic random access memory is a high-performance RAM interface and successor to the Rambus RDRAM it is based on, competing with the rival DDR2 SDRAM and GDDR4 technology. XDR was designed to be effective in small, high-bandwidth consumer systems, high-performance...
memory subsystem and coherent Element Interconnect Bus (EIB) interconnect appear to position Cell for future applications in the supercomputing space to exploit the Cell processor's prowess in floating point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...
kernels.
The Cell architecture includes a memory coherence
Memory coherence
Memory coherence is an issue that affects the design of computer systems in which two or more processors or cores share a common area of memory....
architecture that emphasizes efficiency/watt, prioritizes bandwidth
Bandwidth (computing)
In computer networking and computer science, bandwidth, network bandwidth, data bandwidth, or digital bandwidth is a measure of available or consumed data communication resources expressed in bits/second or multiples of it .Note that in textbooks on wireless communications, modem data transmission,...
over latency
Latency (engineering)
Latency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured. Latencies may have different meaning in different contexts.-Packet-switched networks:...
, and favors peak computational throughput
Throughput
In communication networks, such as Ethernet or packet radio, throughput or network throughput is the average rate of successful message delivery over a communication channel. This data may be delivered over a physical or logical link, or pass through a certain network node...
over simplicity of program code. For these reasons, Cell is widely regarded as a challenging environment for software development
Software engineering
Software Engineering is the application of a systematic, disciplined, quantifiable approach to the development, operation, and maintenance of software, and the study of these approaches; that is, the application of engineering to software...
. IBM provides a comprehensive Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
-based Cell development platform to assist developers in confronting these challenges. Software adoption remains a key issue in whether Cell ultimately delivers on its performance potential. Despite those challenges, research has indicated that Cell excels at several types of scientific computation.
Cell is a microprocessor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...
architecture jointly developed by Sony
Sony
, commonly referred to as Sony, is a Japanese multinational conglomerate corporation headquartered in Minato, Tokyo, Japan and the world's fifth largest media conglomerate measured by revenues....
, Sony Computer Entertainment
Sony Computer Entertainment
Sony Computer Entertainment, Inc. is a major video game company specializing in a variety of areas in the video game industry, and is a wholly owned subsidiary and part of the Consumer Products & Services Group of Sony...
, Toshiba
Toshiba
is a multinational electronics and electrical equipment corporation headquartered in Tokyo, Japan. It is a diversified manufacturer and marketer of electrical products, spanning information & communications equipment and systems, Internet-based solutions and services, electronic components and...
, and IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
, an alliance known as "STI". The architectural design and first implementation were carried out at the STI Design Center in Austin, Texas
Austin, Texas
Austin is the capital city of the U.S. state of :Texas and the seat of Travis County. Located in Central Texas on the eastern edge of the American Southwest, it is the fourth-largest city in Texas and the 14th most populous city in the United States. It was the third-fastest-growing large city in...
over a four-year period beginning March 2001 on a budget reported by Sony as approaching US$400 million. Cell is shorthand for Cell Broadband Engine Architecture, commonly abbreviated CBEA in full or Cell BE in part. Cell combines a general-purpose Power Architecture
Power Architecture
Power Architecture is a broad term to describe similar RISC instruction sets for microprocessors developed and manufactured by such companies as IBM, Freescale, AMCC, Tundra and P.A. Semi...
core
Multi-core (computing)
A multi-core processor is a single computing component with two or more independent actual processors , which are the units that read and execute program instructions...
of modest performance with streamlined coprocessing
Coprocessor
A coprocessor is a computer processor used to supplement the functions of the primary processor . Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, string processing, or encryption. By offloading processor-intensive tasks from the main processor,...
elements which greatly accelerate multimedia
Multimedia
Multimedia is media and content that uses a combination of different content forms. The term can be used as a noun or as an adjective describing a medium as having multiple content forms. The term is used in contrast to media which use only rudimentary computer display such as text-only, or...
and vector processing applications, as well as many other forms of dedicated computation.
The first major commercial application of Cell was in Sony's PlayStation 3
PlayStation 3
The is the third home video game console produced by Sony Computer Entertainment and the successor to the PlayStation 2 as part of the PlayStation series. The PlayStation 3 competes with Microsoft's Xbox 360 and Nintendo's Wii as part of the seventh generation of video game consoles...
game console
Video game console
A video game console is an interactive entertainment computer or customized computer system that produces a video display signal which can be used with a display device to display a video game...
. Mercury Computer Systems
Mercury Computer Systems
Mercury Computer Systems, Inc. provides high-performance embedded, real-time digital signal and image processing solutions.Mercury designs and builds embedded multicomputers, which may be considered to be either loosely coupled NUMA computers or tightly coupled clusters. Despite being marketed as...
has a dual Cell server, a dual Cell blade
Blade server
A blade server is a stripped down server computer with a modular design optimized to minimize the use of physical space and energy. Whereas a standard rack-mount server can function with a power cord and network cable, blade servers have many components removed to save space, minimize power...
configuration, a rugged computer, and a PCI Express accelerator board available in different stages of production. Toshiba had announced plans to incorporate Cell in high definition
High-definition television
High-definition television is video that has resolution substantially higher than that of traditional television systems . HDTV has one or two million pixels per frame, roughly five times that of SD...
television sets, but have seemed to abandon the idea. Exotic features such as the XDR
XDR DRAM
XDR DRAM or extreme data rate dynamic random access memory is a high-performance RAM interface and successor to the Rambus RDRAM it is based on, competing with the rival DDR2 SDRAM and GDDR4 technology. XDR was designed to be effective in small, high-bandwidth consumer systems, high-performance...
memory subsystem and coherent Element Interconnect Bus (EIB) interconnect appear to position Cell for future applications in the supercomputing space to exploit the Cell processor's prowess in floating point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...
kernels.
The Cell architecture includes a memory coherence
Memory coherence
Memory coherence is an issue that affects the design of computer systems in which two or more processors or cores share a common area of memory....
architecture that emphasizes efficiency/watt, prioritizes bandwidth
Bandwidth (computing)
In computer networking and computer science, bandwidth, network bandwidth, data bandwidth, or digital bandwidth is a measure of available or consumed data communication resources expressed in bits/second or multiples of it .Note that in textbooks on wireless communications, modem data transmission,...
over latency
Latency (engineering)
Latency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured. Latencies may have different meaning in different contexts.-Packet-switched networks:...
, and favors peak computational throughput
Throughput
In communication networks, such as Ethernet or packet radio, throughput or network throughput is the average rate of successful message delivery over a communication channel. This data may be delivered over a physical or logical link, or pass through a certain network node...
over simplicity of program code. For these reasons, Cell is widely regarded as a challenging environment for software development
Software engineering
Software Engineering is the application of a systematic, disciplined, quantifiable approach to the development, operation, and maintenance of software, and the study of these approaches; that is, the application of engineering to software...
. IBM provides a comprehensive Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
-based Cell development platform to assist developers in confronting these challenges. Software adoption remains a key issue in whether Cell ultimately delivers on its performance potential. Despite those challenges, research has indicated that Cell excels at several types of scientific computation.
Cell is a microprocessor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...
architecture jointly developed by Sony
Sony
, commonly referred to as Sony, is a Japanese multinational conglomerate corporation headquartered in Minato, Tokyo, Japan and the world's fifth largest media conglomerate measured by revenues....
, Sony Computer Entertainment
Sony Computer Entertainment
Sony Computer Entertainment, Inc. is a major video game company specializing in a variety of areas in the video game industry, and is a wholly owned subsidiary and part of the Consumer Products & Services Group of Sony...
, Toshiba
Toshiba
is a multinational electronics and electrical equipment corporation headquartered in Tokyo, Japan. It is a diversified manufacturer and marketer of electrical products, spanning information & communications equipment and systems, Internet-based solutions and services, electronic components and...
, and IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
, an alliance known as "STI". The architectural design and first implementation were carried out at the STI Design Center in Austin, Texas
Austin, Texas
Austin is the capital city of the U.S. state of :Texas and the seat of Travis County. Located in Central Texas on the eastern edge of the American Southwest, it is the fourth-largest city in Texas and the 14th most populous city in the United States. It was the third-fastest-growing large city in...
over a four-year period beginning March 2001 on a budget reported by Sony as approaching US$400 million. Cell is shorthand for Cell Broadband Engine Architecture, commonly abbreviated CBEA in full or Cell BE in part. Cell combines a general-purpose Power Architecture
Power Architecture
Power Architecture is a broad term to describe similar RISC instruction sets for microprocessors developed and manufactured by such companies as IBM, Freescale, AMCC, Tundra and P.A. Semi...
core
Multi-core (computing)
A multi-core processor is a single computing component with two or more independent actual processors , which are the units that read and execute program instructions...
of modest performance with streamlined coprocessing
Coprocessor
A coprocessor is a computer processor used to supplement the functions of the primary processor . Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, string processing, or encryption. By offloading processor-intensive tasks from the main processor,...
elements which greatly accelerate multimedia
Multimedia
Multimedia is media and content that uses a combination of different content forms. The term can be used as a noun or as an adjective describing a medium as having multiple content forms. The term is used in contrast to media which use only rudimentary computer display such as text-only, or...
and vector processing applications, as well as many other forms of dedicated computation.
The first major commercial application of Cell was in Sony's PlayStation 3
PlayStation 3
The is the third home video game console produced by Sony Computer Entertainment and the successor to the PlayStation 2 as part of the PlayStation series. The PlayStation 3 competes with Microsoft's Xbox 360 and Nintendo's Wii as part of the seventh generation of video game consoles...
game console
Video game console
A video game console is an interactive entertainment computer or customized computer system that produces a video display signal which can be used with a display device to display a video game...
. Mercury Computer Systems
Mercury Computer Systems
Mercury Computer Systems, Inc. provides high-performance embedded, real-time digital signal and image processing solutions.Mercury designs and builds embedded multicomputers, which may be considered to be either loosely coupled NUMA computers or tightly coupled clusters. Despite being marketed as...
has a dual Cell server, a dual Cell blade
Blade server
A blade server is a stripped down server computer with a modular design optimized to minimize the use of physical space and energy. Whereas a standard rack-mount server can function with a power cord and network cable, blade servers have many components removed to save space, minimize power...
configuration, a rugged computer, and a PCI Express accelerator board available in different stages of production. Toshiba had announced plans to incorporate Cell in high definition
High-definition television
High-definition television is video that has resolution substantially higher than that of traditional television systems . HDTV has one or two million pixels per frame, roughly five times that of SD...
television sets, but have seemed to abandon the idea. Exotic features such as the XDR
XDR DRAM
XDR DRAM or extreme data rate dynamic random access memory is a high-performance RAM interface and successor to the Rambus RDRAM it is based on, competing with the rival DDR2 SDRAM and GDDR4 technology. XDR was designed to be effective in small, high-bandwidth consumer systems, high-performance...
memory subsystem and coherent Element Interconnect Bus (EIB) interconnect appear to position Cell for future applications in the supercomputing space to exploit the Cell processor's prowess in floating point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...
kernels.
The Cell architecture includes a memory coherence
Memory coherence
Memory coherence is an issue that affects the design of computer systems in which two or more processors or cores share a common area of memory....
architecture that emphasizes efficiency/watt, prioritizes bandwidth
Bandwidth (computing)
In computer networking and computer science, bandwidth, network bandwidth, data bandwidth, or digital bandwidth is a measure of available or consumed data communication resources expressed in bits/second or multiples of it .Note that in textbooks on wireless communications, modem data transmission,...
over latency
Latency (engineering)
Latency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured. Latencies may have different meaning in different contexts.-Packet-switched networks:...
, and favors peak computational throughput
Throughput
In communication networks, such as Ethernet or packet radio, throughput or network throughput is the average rate of successful message delivery over a communication channel. This data may be delivered over a physical or logical link, or pass through a certain network node...
over simplicity of program code. For these reasons, Cell is widely regarded as a challenging environment for software development
Software engineering
Software Engineering is the application of a systematic, disciplined, quantifiable approach to the development, operation, and maintenance of software, and the study of these approaches; that is, the application of engineering to software...
. IBM provides a comprehensive Linux
Linux
Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...
-based Cell development platform to assist developers in confronting these challenges. Software adoption remains a key issue in whether Cell ultimately delivers on its performance potential. Despite those challenges, research has indicated that Cell excels at several types of scientific computation.
History
In mid-2000, Sony Computer EntertainmentSony Computer Entertainment
Sony Computer Entertainment, Inc. is a major video game company specializing in a variety of areas in the video game industry, and is a wholly owned subsidiary and part of the Consumer Products & Services Group of Sony...
, Toshiba Corporation, and IBM
IBM
International Business Machines Corporation or IBM is an American multinational technology and consulting corporation headquartered in Armonk, New York, United States. IBM manufactures and sells computer hardware and software, and it offers infrastructure, hosting and consulting services in areas...
formed an alliance known as "STI" to design and manufacture the processor.
The STI Design Center opened in March 2001. The Cell was designed over a period of four years, using enhanced versions of the design tools for the POWER4
POWER4
The POWER4 is a microprocessor developed by International Business Machines that implemented the 64-bit PowerPC and PowerPC AS instruction set architectures. Released in 2001, the POWER4 succeeded the POWER3 and RS64 microprocessors, and was used in RS/6000 and AS/400 computers, ending a separate...
processor. Over 400 engineers from the three companies worked together in Austin, with critical support from eleven of IBM's design centers.
During this period, IBM filed many patents pertaining to the Cell architecture, manufacturing process, and software environment. An early patent version of the Broadband Engine was shown to be a chip package comprising four "Processing Elements", which was the patent's description for what is now known as the Power Processing Element. Each Processing Element contained 8 APUs, which are now referred to as SPEs on the current Broadband Engine chip. Said chip package was widely regarded to run at a clock speed of 4 GHz and with 32 APUs providing 32 GFLOPS
FLOPS
In computing, FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating-point calculations, similar to the older, simpler, instructions per second...
each, the Broadband Engine was shown to have 1 teraflop of raw computing power. This design was fabricated using a 90 nm SOI
Silicon on insulator
Silicon on insulator technology refers to the use of a layered silicon-insulator-silicon substrate in place of conventional silicon substrates in semiconductor manufacturing, especially microelectronics, to reduce parasitic device capacitance and thereby improving performance...
process.
In March 2007 IBM announced that the 65 nm version of Cell BE is in production at its plant in East Fishkill, New York
East Fishkill, New York
East Fishkill is a town on the southern border of Dutchess County, New York, United States. The population was 25,589 at the 2000 census. The town name is derived from its formation from Fishkill, NY....
.
In February 2008, IBM announced that it will begin to fabricate Cell processors with the 45 nm process.
In May 2008, IBM introduced the high-performance double-precision floating-point version of the Cell processor, the PowerXCell 8i, at the 65 nm feature size.
In May 2008, an Opteron
Opteron
Opteron is AMD's x86 server and workstation processor line, and was the first processor which supported the AMD64 instruction set architecture . It was released on April 22, 2003 with the SledgeHammer core and was intended to compete in the server and workstation markets, particularly in the same...
- and PowerXCell 8i-based supercomputer, the IBM Roadrunner system, became the world's first system to achieve one petaFLOPS, and was the fastest computer in the world until third quarter 2009. The world's three most energy efficient supercomputers, as represented by the Green500 list, are similarly based on the PowerXCell 8i.
The 45 nm Cell processor was introduced in concert with Sony's PlayStation 3 Slim in August 2009.
In November 2009, an IBM representative said that it has discontinued the development of a Cell processor with 32 SPUs but they have not halted development of other future products in the Cell family.
Commercialization
On May 17, 2005, Sony Computer Entertainment confirmed some specifications of the Cell processor that would be shipping in the then-forthcoming PlayStation 3PlayStation 3
The is the third home video game console produced by Sony Computer Entertainment and the successor to the PlayStation 2 as part of the PlayStation series. The PlayStation 3 competes with Microsoft's Xbox 360 and Nintendo's Wii as part of the seventh generation of video game consoles...
console. This Cell configuration has one Power processing element (PPE) on the core, with eight physical SPEs in silicon. In the PlayStation 3, one SPE is locked-out during the test process, a practice which helps to improve manufacturing yields, and another one is reserved for the OS, leaving 6 free SPEs to be used by games' code. The target clock-frequency at introduction is 3.2 GHz
GHZ
GHZ or GHz may refer to:# Gigahertz .# Greenberger-Horne-Zeilinger state — a quantum entanglement of three particles.# Galactic Habitable Zone — the region of a galaxy that is favorable to the formation of life....
. The introductory design is fabricated using a 90-nanometer SOI
Silicon on insulator
Silicon on insulator technology refers to the use of a layered silicon-insulator-silicon substrate in place of conventional silicon substrates in semiconductor manufacturing, especially microelectronics, to reduce parasitic device capacitance and thereby improving performance...
process, with initial volume production slated for IBM's facility in East Fishkill, New York
East Fishkill, New York
East Fishkill is a town on the southern border of Dutchess County, New York, United States. The population was 25,589 at the 2000 census. The town name is derived from its formation from Fishkill, NY....
.
Note that the relationship between cores
Multi-core (computing)
A multi-core processor is a single computing component with two or more independent actual processors , which are the units that read and execute program instructions...
and threads
Thread (computer science)
In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...
is a common source of confusion. The PPE core is dual threaded
Simultaneous multithreading
Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading...
and manifests in software as two independent threads of execution while each active SPE manifests as a single thread. In the PlayStation 3 configuration as described by Sony, the Cell processor provides nine independent threads of execution.
On June 28, 2005, IBM and Mercury Computer Systems
Mercury Computer Systems
Mercury Computer Systems, Inc. provides high-performance embedded, real-time digital signal and image processing solutions.Mercury designs and builds embedded multicomputers, which may be considered to be either loosely coupled NUMA computers or tightly coupled clusters. Despite being marketed as...
announced a partnership agreement to build Cell-based computer systems for embedded
Embedded system
An embedded system is a computer system designed for specific control functions within a larger system. often with real-time computing constraints. It is embedded as part of a complete device often including hardware and mechanical parts. By contrast, a general-purpose computer, such as a personal...
applications such as medical imaging
Medical imaging
Medical imaging is the technique and process used to create images of the human body for clinical purposes or medical science...
, industrial inspection, aerospace
Aerospace
Aerospace comprises the atmosphere of Earth and surrounding space. Typically the term is used to refer to the industry that researches, designs, manufactures, operates, and maintains vehicles moving through air and space...
and defense
Defense (military)
Defense has several uses in the sphere of military application.Personal defense implies measures taken by individual soldiers in protecting themselves whether by use of protective materials such as armor, or field construction of trenches or a bunker, or by using weapons that prevent the enemy...
, seismic processing
Reflection seismology
Reflection seismology is a method of exploration geophysics that uses the principles of seismology to estimate the properties of the Earth's subsurface from reflected seismic waves. The method requires a controlled seismic source of energy, such as dynamite/Tovex, a specialized air gun or a...
, and telecommunications. Mercury has since then released blades
Blade server
A blade server is a stripped down server computer with a modular design optimized to minimize the use of physical space and energy. Whereas a standard rack-mount server can function with a power cord and network cable, blade servers have many components removed to save space, minimize power...
, conventional rack servers
19-inch rack
A 19-inch rack is a standardized frame or enclosure for mounting multiple equipment modules. Each module has a front panel that is wide, including edges or ears that protrude on each side which allow the module to be fastened to the rack frame with screws.-Overview and history:Equipment designed...
and PCI Express
PCI Express
PCI Express , officially abbreviated as PCIe, is a computer expansion card standard designed to replace the older PCI, PCI-X, and AGP bus standards...
accelerator boards with Cell processors.
In the fall of 2006, IBM released the QS20 blade module using double Cell BE processors for tremendous performance in certain applications, reaching a peak of 410 gigaFLOPS
FLOPS
In computing, FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating-point calculations, similar to the older, simpler, instructions per second...
per module. The QS22 based on the PowerXCell 8i processor is used for the IBM Roadrunner supercomputer. Mercury and IBM uses the fully utilized Cell processor with 8 active SPEs. On April 8, 2008, Fixstars Corporation released a PCI Express
PCI Express
PCI Express , officially abbreviated as PCIe, is a computer expansion card standard designed to replace the older PCI, PCI-X, and AGP bus standards...
accelerator board based on the PowerXCell 8i processor.
Sony's high performance media computing server ZEGO
Zego
The ZEGO is a rackmount server platform built by Sony, targeted for the video postproduction and broadcast markets. The platform is based on Sony's PlayStation 3 as it features both the Cell Processor as well as the RSX 'Reality Synthesizer'...
uses a 3.2 GHz Cell/B.E processor.
Overview
The Cell Broadband Engine—or Cell as it is more commonly known—is a microprocessor designed to bridge the gap between conventional desktop processors (such as the Athlon 64Athlon 64
The Athlon 64 is an eighth-generation, AMD64-architecture microprocessor produced by AMD, released on September 23, 2003. It is the third processor to bear the name Athlon, and the immediate successor to the Athlon XP...
, and Core 2 families) and more specialized high-performance processors, such as the NVIDIA
NVIDIA
Nvidia is an American global technology company based in Santa Clara, California. Nvidia is best known for its graphics processors . Nvidia and chief rival AMD Graphics Techonologies have dominated the high performance GPU market, pushing other manufacturers to smaller, niche roles...
and ATI
Ati
As a word, Ati may refer to:* Ati, a town in Chad* Ati, a Negrito ethnic group in the Philippines* Ati-Atihan Festival, an annual celebration held in the Philippines* Ati, a queen of the fabled Land of Punt in Africa...
graphics-processors (GPU
Graphics processing unit
A graphics processing unit or GPU is a specialized circuit designed to rapidly manipulate and alter memory in such a way so as to accelerate the building of images in a frame buffer intended for output to a display...
s). The longer name indicates its intended use, namely as a component in current and future digital distribution
Digital distribution
Online distribution, digital distribution, or electronic software distribution is the practice of delivering content without the use of physical media, typically by downloading via the internet directly to a consumer's device. Online distribution bypasses conventional physical distribution media,...
systems; as such it may be utilized in high-definition displays and recording equipment, as well as computer entertainment systems for the HDTV
High-definition television
High-definition television is video that has resolution substantially higher than that of traditional television systems . HDTV has one or two million pixels per frame, roughly five times that of SD...
era. Additionally the processor may be suited to digital imaging
Digital imaging
Digital imaging or digital image acquisition is the creation of digital images, typically from a physical scene. The term is often assumed to imply or include the processing, compression, storage, printing, and display of such images...
systems (medical, scientific, etc.) as well as physical simulation (e.g., scientific and structural engineering
Structural engineering
Structural engineering is a field of engineering dealing with the analysis and design of structures that support or resist loads. Structural engineering is usually considered a specialty within civil engineering, but it can also be studied in its own right....
modeling).
In a simple analysis, the Cell processor can be split into four components: external input and output structures, the main processor called the Power Processing Element (PPE) (a two-way simultaneous multithreaded
Simultaneous multithreading
Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading...
Power ISA v.2.03 compliant core), eight fully functional co-processors called the Synergistic Processing Elements, or SPEs, and a specialized high-bandwidth circular data bus connecting the PPE, input/output elements and the SPEs, called the Element Interconnect Bus or EIB.
To achieve the high performance needed for mathematically intensive tasks, such as decoding/encoding MPEG streams, generating or transforming three-dimensional data, or undertaking Fourier analysis of data, the Cell processor marries the SPEs and the PPE via EIB to give access, via fully cache coherent DMA (direct memory access)
Direct memory access
Direct memory access is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit ....
, to both main memory and to other external data storage. To make the best of EIB, and to overlap computation and data transfer, each of the nine processing elements (PPE and SPEs) is equipped with a DMA engine. Since the SPE's load/store instructions can only access its own local memory, each SPE entirely depends on DMAs to transfer data to and from the main memory and other SPEs' local memories. A DMA operation can transfer either a single block area of size up to 16KB, or a list of 2 to 2048 such blocks. One of the major design decisions in the architecture of Cell is the use of DMAs as a central means of intra-chip data transfer, with a view to enabling maximal asynchrony and concurrency in data processing inside a chip.
The PPE, which is capable of running a conventional operating system, has control over the SPEs and can start, stop, interrupt, and schedule processes running on the SPEs. To this end the PPE has additional instructions relating to control of the SPEs. Unlike SPEs, the PPE can read and write the main memory and the local memories of SPEs through the standard load/store instructions. Despite having Turing complete architectures, the SPEs are not fully autonomous and require the PPE to prime them before they can do any useful work. Though most of the "horsepower" of the system comes from the synergistic processing elements, the use of DMA
DMA
DMA can refer to:* DMA , a defunct dance music magazine* Dallas Museum of Art, an art museum in Texas, USA* Danish Music Awards, an award show held in Denmark since 1989...
as a method of data transfer and the limited local memory footprint of each SPE pose a major challenge to software developers who wish to make the most of this horsepower, demanding careful hand-tuning of programs to extract maximal performance from this CPU.
The PPE and bus architecture includes various modes of operation giving different levels of memory protection
Memory protection
Memory protection is a way to control memory access rights on a computer, and is a part of most modern operating systems. The main purpose of memory protection is to prevent a process from accessing memory that has not been allocated to it. This prevents a bug within a process from affecting...
, allowing areas of memory to be protected from access by specific processes running on the SPEs or the PPE.
Both the PPE and SPE are RISC architectures with a fixed-width 32-bit instruction format. The PPE contains a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR), and a 128-bit Altivec
AltiVec
AltiVec is a floating point and integer SIMD instruction set designed and owned by Apple, IBM and Freescale Semiconductor, formerly the Semiconductor Products Sector of Motorola, , and implemented on versions of the PowerPC including Motorola's G4, IBM's G5 and POWER6 processors, and P.A. Semi's...
register set. The SPE contains 128-bit registers only. These can be used for scalar data types ranging from 8-bits to 128-bits in size or for SIMD
SIMD
Single instruction, multiple data , is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously...
computations on a variety of integer and floating point formats. System memory addresses for both the PPE and SPE are expressed as 64-bit values for a theoretic address range of 264 bytes (16 exabytes or 16,777,216 terabytes). In practice, not all of these bits are implemented in hardware. Local store addresses internal to the SPU processor are expressed as a 32-bit word. In documentation relating to Cell a word is always taken to mean 32 bits, a doubleword means 64 bits, and a quadword means 128 bits.
PowerXCell 8i
In 2008, IBM announced a revised variant of the Cell called the PowerXCell 8i, which is available in QS22 Blade Servers from IBM. The PowerXCell is manufactured on a 65 nm process, and adds support for up to 32 GB of slotted DDR2 memory, as well as dramatically improving double-precision floating-point performance on the SPEs from a peak of about 12.8 GFLOPS to 102.4 GFLOPS total for eight SPEs, which, coincidently, is the same peak performance as the NEC SX-9NEC SX-9
The SX-9 is a supercomputer built by NEC Corporation. The SX-9 Series implements an SMP system in a compact node module and uses an enhanced version of the single chip vector processor that was introduced with the SX-6...
vector processor released around the same time. The IBM Roadrunner supercomputer, the world's fastest 2008-2009, consists of 12,240 PowerXCell 8i processors, along with 6,562 AMD Opteron processors. The PowerXCell 8i powered super computers also dominated the all top 6 "greenest" systems in the Green500 list, with highest MFLOPS/Watt ratio supercomputers in the world. Beside the QS22 and supercomputers, the PowerXCell processor is also available as an accelerator on a PCI Express card and is used as the core processor in the QPACE
QPACE
QPACE is pursuing the development of a massive parallel, scalable supercomputer for applications in lattice quantum chromodynamics . The machine structure is a three-dimensional torus of identical processing nodes, based on IBM's PowerXCell 8i processors...
project.
Architecture
While the Cell chip can have a number of different configurations, the basic configuration is a multi-coreMulti-core (computing)
A multi-core processor is a single computing component with two or more independent actual processors , which are the units that read and execute program instructions...
chip composed of one "Power Processor Element" ("PPE") (sometimes called "Processing Element", or "PE"), and multiple "Synergistic Processing Elements" ("SPE"). The PPE and SPEs are linked together by an internal high speed bus dubbed "Element Interconnect Bus" ("EIB"). Due to the nature of its applications, Cell is optimized towards single precision floating point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...
computation. The SPEs are capable of performing double precision
Double precision
In computing, double precision is a computer number format that occupies two adjacent storage locations in computer memory. A double-precision number, sometimes simply called a double, may be defined to be an integer, fixed point, or floating point .Modern computers with 32-bit storage locations...
calculations, albeit with an order of magnitude performance penalty. New chips expected mid-2008 are rumored to boost SPE double precision performance as high as 5x over pre-2008 designs. In the meantime, there are ways to circumvent this in software using iterative refinement, which means values are calculated in double precision only when necessary. Jack Dongarra
Jack Dongarra
Jack J. Dongarra is a University Distinguished Professor of Computer Sciencein the Electrical Engineering and Computer Science Department at the University of Tennessee...
and his team demonstrated a 3.2 GHz Cell with 8 SPEs delivering a performance equal to 100 GFLOPS on an average double precision Linpack
LINPACK
LINPACK is a software library for performing numerical linear algebra on digital computers. It was written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Gilbert Stewart, and was intended for use on supercomputers in the 1970s and early 1980s...
4096x4096 matrix.
Power Processor Element (PPE)
The PPE is the Power ArchitecturePower Architecture
Power Architecture is a broad term to describe similar RISC instruction sets for microprocessors developed and manufactured by such companies as IBM, Freescale, AMCC, Tundra and P.A. Semi...
based, two-way multithreaded
Simultaneous multithreading
Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading...
core acting as the controller for the eight SPEs, which handle most of the computational workload. The PPE will work with conventional operating systems due to its similarity to other 64-bit PowerPC processors, while the SPEs are designed for vectorized floating point code execution. The PPE contains a 64 KiB level 1 cache
Cache
In computer engineering, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere...
(32 KiB instruction and a 32 KiB data) and a 512 KiB Level 2 cache. The size of a cache line is 128 bytes. Additionally, IBM has included an AltiVec
AltiVec
AltiVec is a floating point and integer SIMD instruction set designed and owned by Apple, IBM and Freescale Semiconductor, formerly the Semiconductor Products Sector of Motorola, , and implemented on versions of the PowerPC including Motorola's G4, IBM's G5 and POWER6 processors, and P.A. Semi's...
unit which is fully pipelined for single precision floating point. (Altivec does not support double precision
Double precision
In computing, double precision is a computer number format that occupies two adjacent storage locations in computer memory. A double-precision number, sometimes simply called a double, may be defined to be an integer, fixed point, or floating point .Modern computers with 32-bit storage locations...
floating-point vectors.) Each PPE can complete two double precision operations per clock cycle using a scalar-fused multiply-add instruction, which translates to 6.4 GFLOPS at 3.2 GHz; or eight single precision operations per clock cycle with a vector fused-multiply-add instruction, which translates to 25.6 GFLOPS at 3.2 GHz.
Xenon in Xbox 360
The PPE was designed specifically for the Cell processor but during development, MicrosoftMicrosoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
approached IBM wanting a high performance processor core for its Xbox 360
Xbox 360
The Xbox 360 is the second video game console produced by Microsoft and the successor to the Xbox. The Xbox 360 competes with Sony's PlayStation 3 and Nintendo's Wii as part of the seventh generation of video game consoles...
. IBM complied and made the tri-core Xenon processor
Xenon (processor)
Xenon is a CPU that is used in the Xbox 360 game console. The processor, internally codenamed "Waternoose", which was named after Henry J. Waternoose III in Monsters Inc. by IBM and XCPU by Microsoft, is based on IBM's PowerPC instruction set architecture, consisting of three independent processor...
, based on a slightly modified version of the PPE.
Synergistic Processing Elements (SPE)
Each SPE is composed of a "Synergistic Processing Unit", SPU, and a "Memory Flow Controller", MFC (DMADirect memory access
Direct memory access is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit ....
, MMU
Memory management unit
A memory management unit , sometimes called paged memory management unit , is a computer hardware component responsible for handling accesses to memory requested by the CPU...
, and bus interface). An SPE is a RISC processor with 128-bit
128-bit
There are currently no mainstream general-purpose processors built to operate on 128-bit integers or addresses, though a number of processors do operate on 128-bit data. The IBM System/370 could be considered the first rudimentary 128-bit computer as it used 128-bit floating point registers...
SIMD
SIMD
Single instruction, multiple data , is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously...
organization for single and double precision instructions. With the current generation of the Cell, each SPE contains a 256 KiB embedded SRAM
1T-SRAM
1T-SRAM is a pseudostatic random-access memory technology introduced by MoSys, Inc., which offers a high-density alternative to traditional static random access memory in embedded memory applications...
for instruction and data, called "Local Storage" (not to be mistaken for "Local Memory" in Sony's documents that refer to the VRAM) which is visible to the PPE and can be addressed directly by software. Each SPE can support up to 4 GiB
Gib
Gib may refer to:* A castrated male cat or ferret* Gibibit , measurement unit of digitally stored computer information* Gibraltar, British overseas territory* Drywall, construction material...
of local store memory. The local store does not operate like a conventional CPU cache
Cache
In computer engineering, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere...
since it is neither transparent to software nor does it contain hardware structures that predict which data to load. The SPEs contain a 128-bit, 128-entry register file
Register file
A register file is an array of processor registers in a central processing unit . Modern integrated circuit-based register files are usually implemented by way of fast static RAMs with multiple ports...
and measures 14.5 mm2 on a 90 nm process. An SPE can operate on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers in a single clock cycle, as well as a memory operation. Note that the SPU cannot directly access system memory; the 64-bit virtual memory addresses formed by the SPU must be passed from the SPU to the SPE memory flow controller (MFC) to set up a DMA operation within the system address space.
In one typical usage scenario, the system will load the SPEs with small programs (similar to threads
Thread (computer science)
In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...
), chaining the SPEs together to handle each step in a complex operation. For instance, a set-top box
Set-top box
A set-top box or set-top unit is an information appliance device that generally contains a tuner and connects to a television set and an external source of signal, turning the signal into content which is then displayed on the television screen or other display device.-History:Before the...
might load programs for reading a DVD, video and audio decoding, and display, and the data would be passed off from SPE to SPE until finally ending up on the TV. Another possibility is to partition the input data set and have several SPEs performing the same kind of operation in parallel. At 3.2 GHz, each SPE gives a theoretical 25.6 GFLOPS of single precision performance.
Compared to a modern personal computer
Personal computer
A personal computer is any general-purpose computer whose size, capabilities, and original sales price make it useful for individuals, and which is intended to be operated directly by an end-user with no intervening computer operator...
, the relatively high overall floating point performance of a Cell processor seemingly dwarfs the abilities of the SIMD unit in desktop CPUs like the Pentium 4
Pentium 4
Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the...
and the Athlon 64
Athlon 64
The Athlon 64 is an eighth-generation, AMD64-architecture microprocessor produced by AMD, released on September 23, 2003. It is the third processor to bear the name Athlon, and the immediate successor to the Athlon XP...
. However, comparing only floating point abilities of a system is a one-dimensional and application-specific metric. Unlike a Cell processor, such desktop CPUs are more suited to the general purpose software usually run on personal computers. In addition to executing multiple instructions per clock, processors from Intel and AMD feature branch predictor
Branch predictor
In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch will go before this is known for sure. The purpose of the branch predictor is to improve the flow in the instruction pipeline...
s. The Cell is designed to compensate for this with compiler assistance, in which prepare-to-branch instructions are created. For double-precision floating point operations, as sometimes used in personal computers and often used in scientific computing, Cell performance drops by an order of magnitude, but still reaches 20.8 GFLOPS (1.8 GFLOPS per SPE, 6.4 GFLOPS per PPE). The PowerXCell 8i variant, which was specifically designed for double-precision, reaches 102.4 GFLOPS in double-precision calculations.
Recent tests by IBM show that the SPEs can reach 98% of their theoretical peak performance using optimized parallel Matrix Multiplication.
Toshiba
Toshiba
is a multinational electronics and electrical equipment corporation headquartered in Tokyo, Japan. It is a diversified manufacturer and marketer of electrical products, spanning information & communications equipment and systems, Internet-based solutions and services, electronic components and...
has developed a co-processor powered by four SPEs, but no PPE, called the SpursEngine
SpursEngine
SpursEngine is a microprocessor from Toshiba built as a media oriented coprocessor, designed for 3D- and video processing in consumer electronics such as set-top boxes and computers. The SpursEngine processor is also known as the Quad Core HD processor...
designed to accelerate 3D and movie effects in consumer electronics.
Element Interconnect Bus (EIB)
The EIB is a communication bus internal to the Cell processor which connects the various on-chip system elements: the PPE processor, the memory controller (MIC), the eight SPE coprocessors, and two off-chip I/O interfaces, for a total of 12 participants in the PS3 (the number of SPU can vary in industrial applications). The EIB also includes an arbitration unit which functions as a set of traffic lights. In some documents IBM refers to EIB bus participants as 'units'.The EIB is presently implemented as a circular ring consisting of four 16B-wide unidirectional channels which counter-rotate in pairs. When traffic patterns permit, each channel can convey up to three transactions concurrently. As the EIB runs at half the system clock rate the effective channel rate is 16 bytes every two system clocks. At maximum concurrency
Concurrency
Concurrency, concurrent, or concurrence may refer to:* Concurrence, a legal term referring to the need to prove both actus reus and mens rea...
, with three active transactions on each of the four rings, the peak instantaneous EIB bandwidth is 96B per clock (12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer). While this figure is often quoted in IBM literature it is unrealistic to simply scale this number by processor clock speed. The arbitration unit imposes additional constraints which are discussed in the Bandwidth Assessment section below.
IBM Senior Engineer David Krolak, EIB lead designer, explains the concurrency model:
- A ring can start a new op every three cycles. Each transfer always takes eight beats. That was one of the simplifications we made, it's optimized for streaming a lot of data. If you do small ops, it does not work quite as well. If you think of eight-car trains running around this track, as long as the trains aren't running into each other, they can coexist on the track.
Each participant on the EIB has one 16B read port and one 16B write port. The limit for a single participant is to read and write at a rate of 16B per EIB clock (for simplicity often regarded 8B per system clock). Note that each SPU processor contains a dedicated DMA
Direct memory access
Direct memory access is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit ....
management queue capable of scheduling long sequences of transactions to various endpoints without interfering with the SPU's ongoing computations; these DMA queues can be managed locally or remotely as well, providing additional flexibility in the control model.
Data flows on an EIB channel stepwise around the ring. Since there are twelve participants, the total number of steps around the channel back to the point of origin is twelve. Six steps is the longest distance between any pair of participants. An EIB channel is not permitted to convey data requiring more than six steps; such data must take the shorter route around the circle in the other direction. The number of steps involved in sending the packet has very little impact on transfer latency: the clock speed driving the steps is very fast relative to other considerations. However, longer communication distances are detrimental to the overall performance of the EIB as they reduce available concurrency.
Despite IBM's original desire to implement the EIB as a more powerful cross-bar, the circular configuration they adopted to spare resources rarely represents a limiting factor on the performance of the Cell chip as a whole. In the worst case, the programmer must take extra care to schedule communication patterns where the EIB is able to function at high concurrency levels.
David Krolak explains:
- Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is designed, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just was not enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.
Bandwidth assessment
For the sake of quoting performance numbers, we will assume a Cell processor running at 3.2 GHz, the clock speed most often cited.At this clock frequency each channel flows at a rate of 25.6 GB/s. Viewing the EIB in isolation from the system elements it connects, achieving twelve concurrent transactions at this flow rate works out to an abstract EIB bandwidth of 307.2 GB/s. Based on this view many IBM publications depict available EIB bandwidth as "greater than 300 GB/s". This number reflects the peak instantaneous EIB bandwidth scaled by processor frequency.
However, other technical restrictions are involved in the arbitration mechanism for packets accepted onto the bus. The IBM Systems Performance group explains:
- Each unit on the EIB can simultaneously send and receive 16B of data every bus cycle. The maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system, which is one per bus cycle. Since each snooped address request can potentially transfer up to 128B, the theoretical peak data bandwidth on the EIB at 3.2 GHz is 128Bx1.6 GHz = 204.8 GB/s.
This quote apparently represents the full extent of IBM's public disclosure of this mechanism and its impact. The EIB arbitration unit, the snooping mechanism, and interrupt generation on segment or page translation faults are not well described in the documentation set as yet made public by IBM.
In practice effective EIB bandwidth can also be limited by the ring participants involved. While each of the nine processing cores can sustain 25.6 GB/s read and write concurrently, the memory interface controller (MIC) is tied to a pair of XDR memory channels permitting a maximum flow of 25.6 GB/s for reads and writes combined and the two IO controllers are documented as supporting a peak combined input speed of 25.6 GB/s and a peak combined output speed of 35 GB/s.
To add further to the confusion, some older publications cite EIB bandwidth assuming a 4 GHz system clock. This reference frame results in an instantaneous EIB bandwidth figure of 384 GB/s and an arbitration-limited bandwidth figure of 256 GB/s.
All things considered the theoretic 204.8 GB/s number most often cited is the best one to bear in mind. The IBM Systems Performance group has demonstrated SPU-centric data flows achieving 197 GB/s on a Cell processor running at 3.2 GHz so this number is a fair reflection on practice as well.
Optical interconnect
Sony is currently working on the development of an optical interconnection technology for use in the device-to-device or internal interface of various types of cell-based digital consumer electronics and game systems.Memory and I/O Controllers
Cell contains a dual channel RambusRambus
Rambus Incorporated , founded in 1990, is a technology licensing company. The company became well known for its intellectual property based litigation following the introduction of DDR-SDRAM memory.- History :...
XIO macro which interfaces to Rambus XDR memory
XDR DRAM
XDR DRAM or extreme data rate dynamic random access memory is a high-performance RAM interface and successor to the Rambus RDRAM it is based on, competing with the rival DDR2 SDRAM and GDDR4 technology. XDR was designed to be effective in small, high-bandwidth consumer systems, high-performance...
. The memory interface controller (MIC) is separate from the XIO macro and is designed by IBM. The XIO-XDR link runs at 3.2 Gbit/s per pin. Two 32-bit channels can provide a theoretical maximum of 25.6 GB/s.
The I/O interface, also a Rambus design, is known as FlexIO. The FlexIO interface is organized into 12 lanes, each lane being a unidirectional 8-bit wide point-to-point path. Five 8-bit wide point-to-point paths are inbound lanes to Cell, while the remaining seven are outbound. This provides a theoretical peak bandwidth of 62.4 GB/s (36.4 GB/s outbound, 26 GB/s inbound) at 2.6 GHz. The FlexIO interface can be clocked independently, typ. at 3.2 GHz. 4 inbound + 4 outbound lanes are supporting memory coherency.
Video processing card
Some companies, such as LeadtekLeadtek
Leadtek Research, Inc. is a Taiwanese company, founded in 1986, which focuses on research and development that is specialized in the design and manufacture of graphics cards.- Products :...
, have released PCI-E cards based upon the Cell to allow for "faster than real time" transcoding of H.264, MPEG-2
MPEG-2
MPEG-2 is a standard for "the generic coding of moving pictures and associated audio information". It describes a combination of lossy video compression and lossy audio data compression methods which permit storage and transmission of movies using currently available storage media and transmission...
and MPEG-4
MPEG-4
MPEG-4 is a method of defining compression of audio and visual digital data. It was introduced in late 1998 and designated a standard for a group of audio and video coding formats and related technology agreed upon by the ISO/IEC Moving Picture Experts Group under the formal standard ISO/IEC...
video.
Blade server
On 29 August 2007, IBM announced the BladeCenter QS21. Generating a measured 1.05 giga–floating point operations per second (gigaFLOPS) per watt, with peak performance of approximately 460 GFLOPS it is one of the most power efficient computing platforms to date. A single BladeCenter chassis can achieve 6.4 tera–floating point operations per second (teraFLOPS) and over 25.8 teraFLOPS in a standard 42U rack.IBM Press Release
On 13 May 2008, IBM announced the BladeCenter QS22. The QS22 introduces the PowerXCell 8i processor with five times the double-precision floating point performance of the QS21, and the capacity for up to 32 GB of DDR2 memory on-blade.
IBM Press Release
PCI Express Board
Several companies provide PCI-e boards utilising the IBM PowerXCell 8i. The performance is reported as 179.2 GFlops (SP), 89.6 GFlops (DP) at 2.8 GHz.Console video games
SonySony
, commonly referred to as Sony, is a Japanese multinational conglomerate corporation headquartered in Minato, Tokyo, Japan and the world's fifth largest media conglomerate measured by revenues....
's PlayStation 3
PlayStation 3
The is the third home video game console produced by Sony Computer Entertainment and the successor to the PlayStation 2 as part of the PlayStation series. The PlayStation 3 competes with Microsoft's Xbox 360 and Nintendo's Wii as part of the seventh generation of video game consoles...
video game console
Video game console
A video game console is an interactive entertainment computer or customized computer system that produces a video display signal which can be used with a display device to display a video game...
contains the first production application of the Cell processor, clocked at 3.2 GHz
GHZ
GHZ or GHz may refer to:# Gigahertz .# Greenberger-Horne-Zeilinger state — a quantum entanglement of three particles.# Galactic Habitable Zone — the region of a galaxy that is favorable to the formation of life....
and containing seven out of eight operational SPEs, to allow Sony to increase the yield on the processor manufacture. Only six of the seven SPEs are accessible to developers as one is reserved by the OS.
Home cinema
Toshiba has produced HDTVsHigh-definition television
High-definition television is video that has resolution substantially higher than that of traditional television systems . HDTV has one or two million pixels per frame, roughly five times that of SD...
using Cell. They have already presented a system to decode 48 standard definition
Standard-definition television
Sorete-definition television is a television system that uses a resolution that is not considered to be either enhanced-definition television or high-definition television . The term is usually used in reference to digital television, in particular when broadcasting at the same resolution as...
MPEG-2
MPEG-2
MPEG-2 is a standard for "the generic coding of moving pictures and associated audio information". It describes a combination of lossy video compression and lossy audio data compression methods which permit storage and transmission of movies using currently available storage media and transmission...
streams simultaneously on a 1920×1080
1080i
1080i is the shorthand name for a high-definition television mode. The i means interlaced video; 1080i differs from 1080p, in which the p stands for progressive scan. The term 1080i assumes a widescreen aspect ratio of 16:9, implying a frame size of 1920×1080 pixels...
screen. This can enable a viewer to choose a channel based on dozens of thumbnail videos displayed simultaneously on the screen.
Supercomputing
IBM's latest supercomputer, IBM Roadrunner, is a hybrid of General Purpose CISC Opteron as well as Cell processors. This system assumed the #1 spot on the June 2008 Top 500 list as the first supercomputer to run at petaFLOPSFLOPS
In computing, FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating-point calculations, similar to the older, simpler, instructions per second...
speeds, having gained a sustained 1.026 petaFLOPS speed using the standard Linpack
LINPACK
LINPACK is a software library for performing numerical linear algebra on digital computers. It was written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Gilbert Stewart, and was intended for use on supercomputers in the 1970s and early 1980s...
benchmark. IBM Roadrunner uses the PowerXCell 8i version of the Cell processor, manufactured using 65 nm technology and enhanced SPUs that can handle double precision calculations in the 128-bit registers, reaching double precision 102 GFLOPs per chip.
Cluster computing
Clusters of PlayStation 3PlayStation 3
The is the third home video game console produced by Sony Computer Entertainment and the successor to the PlayStation 2 as part of the PlayStation series. The PlayStation 3 competes with Microsoft's Xbox 360 and Nintendo's Wii as part of the seventh generation of video game consoles...
consoles are an attractive alternative to high-end systems based on Cell blades. Innovative Computing Laboratory, a group led by Jack Dongarra
Jack Dongarra
Jack J. Dongarra is a University Distinguished Professor of Computer Sciencein the Electrical Engineering and Computer Science Department at the University of Tennessee...
, in the Computer Science Department at the University of Tennessee, investigated such an application in depth. Terrasoft Solutions is selling 8-node and 32-node PS3 clusters with Yellow Dog Linux
Yellow Dog Linux
Yellow Dog Linux, also known as YDL, is a free and open source operating system for high performance computing on multicore architectures. It focuses on GPU systems and computers using the Power Architecture . YDL is currently developed by Fixstars...
pre-installed, an implementation of Dongarra's research.
As first reported by Wired Magazine on October 17, 2007, an interesting application of using PlayStation 3 in a cluster configuration was implemented by Astrophysicist Dr. Gaurav Khanna, from the Physics department of University of Massachusetts Dartmouth
University of Massachusetts Dartmouth
The University of Massachusetts Dartmouth is one of five campuses and operating subdivisions of the University of Massachusetts . It is located in North Dartmouth, Massachusetts, United States, in the center of the South Coast region, between the cities of New Bedford to the east and Fall River...
, who replaced time used on supercomputers with a cluster of eight PlayStation 3s. Subsequently, the next generation of this machine, now called the PlayStation 3
PlayStation 3
The is the third home video game console produced by Sony Computer Entertainment and the successor to the PlayStation 2 as part of the PlayStation series. The PlayStation 3 competes with Microsoft's Xbox 360 and Nintendo's Wii as part of the seventh generation of video game consoles...
Gravity Grid, uses a network of 16 machines, and exploits the Cell processor for the intended application which is binary black hole
Black hole
A black hole is a region of spacetime from which nothing, not even light, can escape. The theory of general relativity predicts that a sufficiently compact mass will deform spacetime to form a black hole. Around a black hole there is a mathematically defined surface called an event horizon that...
coalescence using perturbation theory
Perturbation theory
Perturbation theory comprises mathematical methods that are used to find an approximate solution to a problem which cannot be solved exactly, by starting from the exact solution of a related problem...
. In particular, the cluster performs astrophysical simulations of large supermassive black holes capturing smaller compact objects and has generated numerical data that has been published multiple times in the relevant scientific research literature. The Cell processor version used by the PlayStation 3 has a main CPU and 6 floating-point vector processors, giving the Gravity Grid machine a net of 16 general-purpose processors and 96 vector processors. The machine has a one-time cost of over $9,000 to build and is adequate for black-hole simulations which would otherwise cost $6,000 per run on a conventional supercomputer. The black hole calculations are not memory-intensive and are highly localizable, and so are well-suited to this architecture. Khanna claims that the cluster's performance exceeds that of a 100+ Intel Xeon core based traditional Linux cluster on his simulations. The PS3 Gravity Grid gathered significant media attention through 2007, 2008, 2009 and 2010.
The computational Biochemistry and Biophysics lab at the Universitat Pompeu Fabra, in Barcelona
Barcelona
Barcelona is the second largest city in Spain after Madrid, and the capital of Catalonia, with a population of 1,621,537 within its administrative limits on a land area of...
, deployed in 2007 a BOINC system called PS3GRID
GPUGRID.net
GPUGRID is a distributed computing project hosted by Pompeu Fabra University and running on the Berkeley Open Infrastructure for Network Computing software platform...
for collaborative computing based on the CellMD software, the first one designed specifically for the Cell processor.
The United States Air Force Research Laboratory
Air Force Research Laboratory
The Air Force Research Laboratory is a scientific research organization operated by the United States Air Force Materiel Command dedicated to leading the discovery, development, and integration of affordable aerospace warfighting technologies; planning and executing the Air Force science and...
has announced plans to deploy a PlayStation 3 cluster of 1760 units, nicknamed the "Condor Cluster", for analyzing high-resolution satellite imagery
Satellite imagery
Satellite imagery consists of photographs of Earth or other planets made by means of artificial satellites.- History :The first images from space were taken on sub-orbital flights. The U.S-launched V-2 flight on October 24, 1946 took one image every 1.5 seconds...
. The Air Force claims the Condor Cluster would be the 33rd largest supercomputer in the world in terms of capacity.
Distributed computing
With the help of the computing power of over half a million PlayStation 3 consoles, the distributed computing project Folding@HomeFolding@home
Folding@home is a distributed computing project designed to use spare processing power on personal computers to perform simulations of disease-relevant protein folding and other molecular dynamics, and to improve on the methods of doing so...
has been recognized by Guinness World Records
Guinness World Records
Guinness World Records, known until 2000 as The Guinness Book of Records , is a reference book published annually, containing a collection of world records, both human achievements and the extremes of the natural world...
as the most powerful distributed network in the world. The first record was achieved on September 16, 2007, as the project surpassed one petaFLOPS
FLOPS
In computing, FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating-point calculations, similar to the older, simpler, instructions per second...
, which had never been reached before by a distributed computing network. Additionally, the collective efforts enabled PS3 alone to reach the petaFLOPS mark on September 23, 2007. In comparison, the world's second most powerful supercomputer at the time, IBM's BlueGene/L, performed at around 478.2 teraFLOPS. This means Folding@Home's computing power is approximately twice BlueGene/L's (although the CPU interconnect in BlueGene/L is more than one million times faster than the mean network speed in Folding@Home.). As of May 7th 2011, Folding@home runs at about 9.3 x86 petaFLOPS, with 1.6 petaFLOPS generated by 26,000 active PS3s alone. In late 2008, A cluster of 200 PlayStation 3 consoles was used to generate a rogue SSL
Transport Layer Security
Transport Layer Security and its predecessor, Secure Sockets Layer , are cryptographic protocols that provide communication security over the Internet...
certificate, effectively cracking its encryption.
Mainframes
IBM announced April 25, 2007 that it will begin integrating its Cell Broadband Engine Architecture microprocessors into the company's line of mainframes.Password cracking
The architecture of the processor makes it better suited to hardware-assisted cryptographic brute force attackBrute force attack
In cryptography, a brute-force attack, or exhaustive key search, is a strategy that can, in theory, be used against any encrypted data. Such an attack might be utilized when it is not possible to take advantage of other weaknesses in an encryption system that would make the task easier...
applications than conventional processors.
Software engineering
Due to the flexible nature of the Cell, there are several possibilities for the utilization of its resources, not limited to just different computing paradigms:Job queue
The PPE maintains a job queue, schedules jobs in SPEs, and monitors progress. Each SPE runs a "mini kernel" whose role is to fetch a job, execute it, and synchronize with the PPE.Self-multitasking of SPEs
The kernel and scheduling is distributed across the SPEs. Tasks are synchronized using mutexesMutual exclusion
Mutual exclusion algorithms are used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable, by pieces of computer code called critical sections. A critical section is a piece of code in which a process or thread accesses a common resource...
or semaphores
Semaphore (programming)
In computer science, a semaphore is a variable or abstract data type that provides a simple but useful abstraction for controlling access by multiple processes to a common resource in a parallel programming environment....
as in a conventional operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
. Ready-to-run tasks wait in a queue for an SPE to execute them. The SPEs use shared memory for all tasks in this configuration.
Stream processing
Each SPE runs a distinct program. Data comes from an input stream, and is sent to SPEs. When an SPE has terminated the processing, the output data is sent to an output stream.This provides a flexible and powerful architecture for stream processing
Stream processing
Stream processing is a computer programming paradigm, related to SIMD , that allows some applications to more easily exploit a limited form of parallel processing...
, and allows explicit scheduling for each SPE separately. Other processors are also able to perform streaming tasks, but are limited by the kernel loaded.
Open source software development
An open source software-based strategy was adopted to accelerate the development of a Cell BE ecosystem and to provide an environment to develop Cell applications. In 2005, patches enabling Cell support in the Linux kernel were submitted for inclusion by IBM developers. Arnd Bergmann (one of the developers of the aforementioned patches) also described the Linux-based Cell architecture at LinuxTagLinuxTag
LinuxTag is a Free Software expo with an emphasis on Linux , held every summer in Germany. It is relatively large, claiming that it is the largest expo of this kind in Europe, drawing visitors from many countries....
2005.
Both PPE and SPEs are programmable in C/C++ using a common API provided by libraries.
Fixstars Solutions
Fixstars Solutions
Fixstars Solutions, Inc is a software and services company specializing in Multi-core processor, particularly in the Nvidia's GPU and CUDA environment, IBM Power7 and Cell....
provides Yellow Dog Linux
Yellow Dog Linux
Yellow Dog Linux, also known as YDL, is a free and open source operating system for high performance computing on multicore architectures. It focuses on GPU systems and computers using the Power Architecture . YDL is currently developed by Fixstars...
for IBM, and Mercury Cell-based systems, as well as for the PlayStation 3. Terra Soft strategically partnered with Mercury to provide a Linux Board Support Package for Cell, and support and development of software applications on various other Cell platforms, including the IBM BladeCenter JS21 and Cell QS20, and Mercury Cell-based solutions. Terra Soft also maintains the Y-HPC (High Performance Computing) Cluster Construction and Management Suite and Y-Bio gene sequencing tools. Y-Bio is built upon the RPM Linux standard for package management, and offers tools which help bioinformatics researchers conduct their work with greater efficiency. IBM has developed a pseudo-filesystem for Linux coined "Spufs" that simplifies access to and use of the SPE resources. IBM is currently maintaining a Linux kernel and GDB ports, while Sony maintains the GNU toolchain
GNU toolchain
The GNU toolchain is a blanket term for a collection of programming tools produced by the GNU Project. These tools form a toolchain used for developing applications and operating systems....
(GCC
GNU Compiler Collection
The GNU Compiler Collection is a compiler system produced by the GNU Project supporting various programming languages. GCC is a key component of the GNU toolchain...
, binutils).
In November 2005, IBM released a "Cell Broadband Engine (CBE) Software Development Kit Version 1.0", consisting of a simulator and assorted tools, to its web site. Development versions of the latest kernel and tools for Fedora Core
Fedora (operating system)
Fedora is a RPM-based, general purpose collection of software, including an operating system based on the Linux kernel, developed by the community-supported Fedora Project and sponsored by Red Hat...
4 are maintained at the Barcelona Supercomputing Center
Barcelona Supercomputing Center
Barcelona Supercomputing Center , also known by the acronym BSC, is a public research center located in Barcelona, Spain. It hosts MareNostrum, Europe's 25th most powerful supercomputer as of November 2009....
website.
In August 2007, Mercury Computer Systems released a Software Development Kit for PLAYSTATION(R)3 for High-Performance Computing.
In November 2007, Fixstars Corporation released the new "CVCell" module aiming to accelerate several important OpenCV
OpenCV
OpenCV is a library of programming functions mainly aimed at real time computer vision, developed by Intel and now supported by Willow Garage. It is free for use under the open source BSD license. The library is cross-platform. It focuses mainly on real-time image processing...
APIs for Cell. In a series of software calculation tests, they recorded execution times on a 3.2 GHz Cell processor that were between 6x and 27x faster compared with the same software on a 2.4 GHz Intel Core 2 Duo.
With the release of kernel version 2.6.16 on March 20, 2006, the Linux kernel officially supports the Cell processor.
See also
- STI Center of Competence for the Cell ProcessorSony Toshiba IBM Center of Competence for the Cell ProcessorThe Sony Toshiba IBM Center of Competence for the Cell Processor is the first Center of Competence dedicated to the promotion and development of Sony Toshiba IBM's Cell microprocessor, an eight-core multiprocessor designed using principles of parallelism and memory latency. The center is part of...
External links
- Cell Broadband Engine resource center
- Standards and documentation
- Sony Computer Entertainment Incorporated's Cell resource page
- Cmpware Configurable Multiprocessor Development Kit for Cell BE
- ISSCC 2005: The CELL Microprocessor, a comprehensive overview of the CELL microarchitecture
- Holy Chip!
- The little broadband engine that could
- Introducing the IBM/Sony/Toshiba Cell Processor — Part I: the SIMD processing units
- Introducing the IBM/Sony/Toshiba Cell Processor -- Part II: The Cell Architecture
- The Soul of Cell: An interview with Dr. H. Peter Hofstee