AMD FireStream
Encyclopedia
The AMD FireStream is a stream processor
produced by Advanced Micro Devices
(AMD) to utilize the stream processing/GPGPU
(General Purpose Graphics Processing Units) concept for heavy floating-point
computations to target various industries, such as the High Performance Computing
(HPC), scientific, and financial sectors. Originally developed by ATI Technologies
until the company was acquired by AMD in 2006, the product line was previously branded as both ATI FireStream and AMD Stream Processor. The AMD FireStream can also be used as a floating-point co-processor for offloading CPU calculations, which is part of the Torrenza
initiative.
and GeForce G70
GPU cores, the programmable shaders architecture with large floating-point (FP) throughput has drawn more attention from academic and commercial interest groups, primarily for its ability to process data besides its original intended use of rendering visual effects. Due to the displayed interest, more resources were allocated towards developing GPGPU products — responsible for calculating general purpose mathematical formulas — to process heavy calculations which were previously running on mainstream servers, desktop Central Processing Unit
s (CPU), and specialized floating-point math co-processors
. GPGPUs were projected to have performance gains upwards of a factor of 10 when compared to CPU-only projections.
Similar GPGPUs appeared as early as the early 2000s. BionicFX was experimenting with processing audio data with a GeForce 6800
video card
, announcing the Audio Video EXchange (AVEX) framework, with similar trials being performed by ATI at about the same time. Another example is the Folding@Home
distributed computing
research program from Stanford University
. This was the first piece of software to use the Radeon R580 GPU and other ATI GPU cores, equipped with a special beta version of the ATI Catalyst driver (version 6.5), to perform computations unrelated to graphics. Since May 2006, it has used the GPU cores to accelerate the simulation of protein folding
in order to investigate protein-related diseases. At this time, the ATI FireStream was in its planning stages.
With the acquisition of ATI complete, AMD officially announced the reconstruction of branding and announced the AMD Stream Processor (originally the ATI FireStream) on November 15, 2006 as the industry's first commercially available hardware stream processing solution. Based on an ATI Radeon X1900 video card, the AMD Stream Processor is a specialized add-on card that implements the R580 Graphics Processing Unit
(GPU). However, it was targeted at complex floating-point calculations used in scientific and financial fields instead of 3D graphics acceleration. AMD claimed that this processor had 8 times the floating-point performance over traditional graphics data processing.
In fact, ATI had put considerable effort into research and development
(R&D) of a GPGPU product before their purchase by AMD, and announced the adoption of the stream processing/GPGPU concept in its line of GPU cores in 2006, codenamed Radeon R580.
The brand was further renamed to AMD FireStream with the second generation of stream processors (based on a 55 nm process), released on November 8, 2007. Future plans include the development of a stream processor on an MXM module
, intended for embedded applications and next generation products in the fourth quarter of 2008.
series graphics processors are 32-bit single-precision floating point
vector processors. Due to the highly parallel nature of vector processors, these processors have had a huge impact in specific data processing applications. The mass client project Folding@Home has reported speed improvements of 20 to 40 times using an R580-based graphics card.
The Radeon R580 core includes a total of 48 pixel and vertex shaders, which become parallel processors in floating-point calculations. The ATI FireStream add-on card utilizes the PCI Express
x16 interface to provide 8 GiB/s bandwidth. The card is equipped with 1 GB GDDR3
local memory while the GPU runs at 600 MHz core frequency and 1.3 GHz memory frequency. The core has the ability to execute 512 threads simultaneously (Simultaneous multithreading
, SMT), at a rated thermal design power
(TDP) of 165 W
. The main difference between the AMD FireStream and ordinary Radeon series video cards is that the stream processor on the FireStream lacks video output connectors.
The stream processing hardware comes with a hardware interface called THIN (Thin Hardware INterface), or Close to Metal
(CTM, previously named Data Parallel Virtual Machine), to open the GPU architecture in addition to native instruction sets to program developers. This allows to direct control of the stream processors/ALUs and the memory controllers, and permits bypassing of the 3D API
layer.
The AMD Stream Processing lineup saw an update to the latest GPU architecture (the Radeon R600
) with the release of the latest-generation FireGL
video cards on August 7, 2007, which are also capable of stream processing. The architecture was manufactured on the same 80 nm fabrication process node as R580, with more parallel processors and stream processing units. In addition, the maximum GDDR4
memory was increased to 2 GB, providing a maximum of 128 GiB/s of memory bandwidth. The R600 XTX core-based FireGL products released (FireGL V8600 and FireGL V8650) consume more power than the first-generation ATI FireStream, with rated TDP of under 225 W and over 255 W respectively.
The second generation, the AMD FireStream 9170, is based on the RV670 core and is constructed using a 55 nm fabrication process. It features industry's first hardware-based support for double-precision floating-point numbers, asynchronous DMA
(giving the stream processors and onboard memory the ability to exchange data without CPU intervention), memory export functionality, and reduced power consumption (less than 150 W with 2 GB GDDR3 memory onboard on a PCI-E 2.0 interface, providing 16 GiB/s device I/O bandwidth).
The latest generation of products in the AMD FireStream line is FireStream 9250 and 9270. The AMD FireStream 9250, announced on June 16, 2008, is based on the RV770 core and is manufactured using 55 nm fabrication process. It features 1 TFLOPS of raw floating-point power on single-precision operations, 1 GiB of GDDR3 memory and a single-slot cooler. While the other variant, the AMD FireStream 9270, announced on November 13, 2008, also features the RV770 core but with a higher floating point operation performance at 1.2 TFLOPS peak, 2 GB of GDDR5
memory and a dual-slot cooler.
Notes:
(acquired by Google
in June 2007), who was first to provide an open beta version of software to support CTM and AMD FireStream as well as x86 and Cell
(Cell Broadband Engine) processors. The FireStream was claimed to be 20 times faster in typical applications than regular CPUs after running PeakStream's software . RapidMind
also provided stream processing software that worked with ATI and NVIDIA, as well as Cell processors.
(v1.0), in December 2007 under the AMD EULA, to be run on Windows XP
. The SDK includes "Brook+", an AMD hardware optimized version of the Brook
language developed by Stanford University, itself a variant of the ANSI C
(C language
), open-sourced
and optimized for stream computing. The AMD Core Math Library
(ACML) and AMD Performance Library
(APL) with optimizations for the AMD FireStream and the COBRA video library (further renamed as "Accelerated Video Transcoding" or AVT) for video transcoding
acceleration will also be included. Another important part of the SDK, the Compute Abstraction Layer (CAL), is a software development layer aimed for low-level access, through the CTM hardware interface, to the GPU architecture for performance tuning software written in various high-level programming language
s.
In August 2011, AMD released version 2.5 of the ATI APP Software Development Kit, which includes support for OpenCL 1.1
, a parallel computing
language developed by Khronos Group
. The concept of compute shaders, officially called DirectCompute, in Microsoft
's next generation API called DirectX 11 is already included in the graphics drivers with DirectX 11 support.
processors and two Radeon R600 GPU cores running on Microsoft
Windows XP Professional, 1 teraflop (TFLOP) can be achieved by a universal multiply-add (MADD) calculation. By comparison, an Intel Core 2 Quad Q9650 3.0 GHz processor can achieve up to 48 GFLOPS.
Recent demonstrations showed that, in Kaspersky SafeStream anti-virus scanning tests optimized for AMD stream processors, the system with two AMD stream processors with dual Opteron processors spotted 6.2 Gbit/s (775 MiB/s) bandwidth, 21 times faster when compared to other dual-processor systems. The stream processor systems also showed only 1-2% CPU utilization, which indicates significant offloading from the CPU to the stream processor.
Stream processing
Stream processing is a computer programming paradigm, related to SIMD , that allows some applications to more easily exploit a limited form of parallel processing...
produced by Advanced Micro Devices
Advanced Micro Devices
Advanced Micro Devices, Inc. or AMD is an American multinational semiconductor company based in Sunnyvale, California, that develops computer processors and related technologies for commercial and consumer markets...
(AMD) to utilize the stream processing/GPGPU
GPGPU
General-purpose computing on graphics processing units is the technique of using a GPU, which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the CPU...
(General Purpose Graphics Processing Units) concept for heavy floating-point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...
computations to target various industries, such as the High Performance Computing
High-performance computing
High-performance computing uses supercomputers and computer clusters to solve advanced computation problems. Today, computer systems approaching the teraflops-region are counted as HPC-computers.-Overview:...
(HPC), scientific, and financial sectors. Originally developed by ATI Technologies
ATI Technologies
ATI Technologies Inc. was a semiconductor technology corporation based in Markham, Ontario, Canada, that specialized in the development of graphics processing units and chipsets. Founded in 1985 as Array Technologies Inc., the company was listed publicly in 1993 and was acquired by Advanced Micro...
until the company was acquired by AMD in 2006, the product line was previously branded as both ATI FireStream and AMD Stream Processor. The AMD FireStream can also be used as a floating-point co-processor for offloading CPU calculations, which is part of the Torrenza
Torrenza
Torrenza was an initiative announced by Advanced Micro Devices in 2006 to improve support for the integration of specialized coprocessors in systems based on AMD Opteron microprocessors...
initiative.
Overview
Since the release of the past-generation Radeon R520Radeon R520
ATI's "R520" core is the foundation for a line of DirectX 9.0c and OpenGL 2.0 3D accelerator X1000 video cards. It is ATI's first major architectural overhaul since the "R300" core and is highly optimized for Shader Model 3.0. The Radeon X1000 series using the core was introduced on October 5,...
and GeForce G70
GeForce 7 Series
The GeForce 7 Series is the seventh generation of Nvidia's GeForce graphics processing units.-Features:The following features are common to all models in the GeForce 7 series except the GeForce 7100, which lacks GCAA:-GeForce 7100 Series:...
GPU cores, the programmable shaders architecture with large floating-point (FP) throughput has drawn more attention from academic and commercial interest groups, primarily for its ability to process data besides its original intended use of rendering visual effects. Due to the displayed interest, more resources were allocated towards developing GPGPU products — responsible for calculating general purpose mathematical formulas — to process heavy calculations which were previously running on mainstream servers, desktop Central Processing Unit
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...
s (CPU), and specialized floating-point math co-processors
Coprocessor
A coprocessor is a computer processor used to supplement the functions of the primary processor . Operations performed by the coprocessor may be floating point arithmetic, graphics, signal processing, string processing, or encryption. By offloading processor-intensive tasks from the main processor,...
. GPGPUs were projected to have performance gains upwards of a factor of 10 when compared to CPU-only projections.
Similar GPGPUs appeared as early as the early 2000s. BionicFX was experimenting with processing audio data with a GeForce 6800
GeForce 6 Series
The GeForce 6 Series is Nvidia's sixth generation of GeForce graphic processing units. Launched on April 14, 2004, the GeForce 6 family introduced PureVideo post-processing for video, SLI technology, and Shader Model 3.0 support .-GeForce 6 Series features:-SLI:The Scalable Link...
video card
Video card
A video card, Graphics Card, or Graphics adapter is an expansion card which generates output images to a display. Most video cards offer various functions such as accelerated rendering of 3D scenes and 2D graphics, MPEG-2/MPEG-4 decoding, TV output, or the ability to connect multiple monitors...
, announcing the Audio Video EXchange (AVEX) framework, with similar trials being performed by ATI at about the same time. Another example is the Folding@Home
Folding@home
Folding@home is a distributed computing project designed to use spare processing power on personal computers to perform simulations of disease-relevant protein folding and other molecular dynamics, and to improve on the methods of doing so...
distributed computing
Distributed computing
Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal...
research program from Stanford University
Stanford University
The Leland Stanford Junior University, commonly referred to as Stanford University or Stanford, is a private research university on an campus located near Palo Alto, California. It is situated in the northwestern Santa Clara Valley on the San Francisco Peninsula, approximately northwest of San...
. This was the first piece of software to use the Radeon R580 GPU and other ATI GPU cores, equipped with a special beta version of the ATI Catalyst driver (version 6.5), to perform computations unrelated to graphics. Since May 2006, it has used the GPU cores to accelerate the simulation of protein folding
Protein folding
Protein folding is the process by which a protein structure assumes its functional shape or conformation. It is the physical process by which a polypeptide folds into its characteristic and functional three-dimensional structure from random coil....
in order to investigate protein-related diseases. At this time, the ATI FireStream was in its planning stages.
With the acquisition of ATI complete, AMD officially announced the reconstruction of branding and announced the AMD Stream Processor (originally the ATI FireStream) on November 15, 2006 as the industry's first commercially available hardware stream processing solution. Based on an ATI Radeon X1900 video card, the AMD Stream Processor is a specialized add-on card that implements the R580 Graphics Processing Unit
Graphics processing unit
A graphics processing unit or GPU is a specialized circuit designed to rapidly manipulate and alter memory in such a way so as to accelerate the building of images in a frame buffer intended for output to a display...
(GPU). However, it was targeted at complex floating-point calculations used in scientific and financial fields instead of 3D graphics acceleration. AMD claimed that this processor had 8 times the floating-point performance over traditional graphics data processing.
In fact, ATI had put considerable effort into research and development
Research and development
The phrase research and development , according to the Organization for Economic Co-operation and Development, refers to "creative work undertaken on a systematic basis in order to increase the stock of knowledge, including knowledge of man, culture and society, and the use of this stock of...
(R&D) of a GPGPU product before their purchase by AMD, and announced the adoption of the stream processing/GPGPU concept in its line of GPU cores in 2006, codenamed Radeon R580.
The brand was further renamed to AMD FireStream with the second generation of stream processors (based on a 55 nm process), released on November 8, 2007. Future plans include the development of a stream processor on an MXM module
Mobile PCI Express Module
A Mobile PCI Express Module is an interconnect standard for GPUs in laptops using PCI Express created by MXM-SIG...
, intended for embedded applications and next generation products in the fourth quarter of 2008.
Hardware
The RadeonRadeon
Radeon is a brand of graphics processing units and random access memory produced by Advanced Micro Devices , first launched in 2000 by ATI Technologies, which was acquired by AMD in 2006. Radeon is the successor to the Rage line. There are four different groups, which can be differentiated by...
series graphics processors are 32-bit single-precision floating point
Floating point
In computing, floating point describes a method of representing real numbers in a way that can support a wide range of values. Numbers are, in general, represented approximately to a fixed number of significant digits and scaled using an exponent. The base for the scaling is normally 2, 10 or 16...
vector processors. Due to the highly parallel nature of vector processors, these processors have had a huge impact in specific data processing applications. The mass client project Folding@Home has reported speed improvements of 20 to 40 times using an R580-based graphics card.
The Radeon R580 core includes a total of 48 pixel and vertex shaders, which become parallel processors in floating-point calculations. The ATI FireStream add-on card utilizes the PCI Express
PCI Express
PCI Express , officially abbreviated as PCIe, is a computer expansion card standard designed to replace the older PCI, PCI-X, and AGP bus standards...
x16 interface to provide 8 GiB/s bandwidth. The card is equipped with 1 GB GDDR3
GDDR3
Graphics Double Data Rate 3 is a graphics card-specific memory technology, designed by ATI Technologies with the collaboration of JEDEC.It has much the same technological base as DDR2, but the power and heat dispersal requirements have been reduced somewhat, allowing for higher performance memory...
local memory while the GPU runs at 600 MHz core frequency and 1.3 GHz memory frequency. The core has the ability to execute 512 threads simultaneously (Simultaneous multithreading
Simultaneous multithreading
Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading...
, SMT), at a rated thermal design power
Thermal Design Power
The thermal design power , sometimes called thermal design point, refers to the maximum amount of power the cooling system in a computer is required to dissipate. For example, a laptop's CPU cooling system may be designed for a 20 watt TDP, which means that it can dissipate up to 20 watts of heat...
(TDP) of 165 W
Watt
The watt is a derived unit of power in the International System of Units , named after the Scottish engineer James Watt . The unit, defined as one joule per second, measures the rate of energy conversion.-Definition:...
. The main difference between the AMD FireStream and ordinary Radeon series video cards is that the stream processor on the FireStream lacks video output connectors.
The stream processing hardware comes with a hardware interface called THIN (Thin Hardware INterface), or Close to Metal
Close to Metal
Close To Metal is the name of a beta version of a low-level programming interface developed by ATI , aimed at enabling GPGPU computing...
(CTM, previously named Data Parallel Virtual Machine), to open the GPU architecture in addition to native instruction sets to program developers. This allows to direct control of the stream processors/ALUs and the memory controllers, and permits bypassing of the 3D API
Application programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...
layer.
The AMD Stream Processing lineup saw an update to the latest GPU architecture (the Radeon R600
Radeon R600
The graphics processing unit codenamed the Radeon R600 is the foundation of the Radeon HD 2000/3000 series and the FireGL 2007 series video cards developed by ATI Technologies...
) with the release of the latest-generation FireGL
ATI FireGL
The ATI FireGL range of video cards, renamed to FirePro 3D in late 2008, is the series specifically for CAD and DCC software, usually found in workstations.-History:...
video cards on August 7, 2007, which are also capable of stream processing. The architecture was manufactured on the same 80 nm fabrication process node as R580, with more parallel processors and stream processing units. In addition, the maximum GDDR4
GDDR4
GDDR4 SDRAM is a type of graphics card memory specified by the JEDEC Semiconductor Memory Standard. It is a rival medium to Rambus's XDR DRAM...
memory was increased to 2 GB, providing a maximum of 128 GiB/s of memory bandwidth. The R600 XTX core-based FireGL products released (FireGL V8600 and FireGL V8650) consume more power than the first-generation ATI FireStream, with rated TDP of under 225 W and over 255 W respectively.
The second generation, the AMD FireStream 9170, is based on the RV670 core and is constructed using a 55 nm fabrication process. It features industry's first hardware-based support for double-precision floating-point numbers, asynchronous DMA
Direct memory access
Direct memory access is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit ....
(giving the stream processors and onboard memory the ability to exchange data without CPU intervention), memory export functionality, and reduced power consumption (less than 150 W with 2 GB GDDR3 memory onboard on a PCI-E 2.0 interface, providing 16 GiB/s device I/O bandwidth).
The latest generation of products in the AMD FireStream line is FireStream 9250 and 9270. The AMD FireStream 9250, announced on June 16, 2008, is based on the RV770 core and is manufactured using 55 nm fabrication process. It features 1 TFLOPS of raw floating-point power on single-precision operations, 1 GiB of GDDR3 memory and a single-slot cooler. While the other variant, the AMD FireStream 9270, announced on November 13, 2008, also features the RV770 core but with a higher floating point operation performance at 1.2 TFLOPS peak, 2 GB of GDDR5
GDDR5
GDDR5 SDRAM is a type of high performance DRAM graphics card memory designed for computer applications requiring high bandwidth...
memory and a dual-slot cooler.
AMD stream processing lineup
The hardware specifications of stream processors released by AMD (and previously ATI) are summarized as follows:Generation | Model | Video card equivalent |
GPU Core | Threads max. |
Core | Memory | Raw processing power (Floating-Point Operations per Second FLOPS In computing, FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating-point calculations, similar to the older, simpler, instructions per second... ) |
Peak TDP Thermal Design Power The thermal design power , sometimes called thermal design point, refers to the maximum amount of power the cooling system in a computer is required to dissipate. For example, a laptop's CPU cooling system may be designed for a 20 watt TDP, which means that it can dissipate up to 20 watts of heat... (watt Watt The watt is a derived unit of power in the International System of Units , named after the Scottish engineer James Watt . The unit, defined as one joule per second, measures the rate of energy conversion.-Definition:... s) |
Others | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SPUs NB1 | Clock (MHz) |
Bandwidth (GiB/s) | Type | Bus width (bit Bit A bit is the basic unit of information in computing and telecommunications; it is the amount of information stored by a digital device or other physical system that exists in one of two possible distinct states... ) |
Amount (MiB) |
Clock (MHz) |
FP32 Single precision floating-point format Single-precision floating-point format is a computer number format that occupies 4 bytes in computer memory and represents a wide dynamic range of values by using a floating point.... GFLOPs |
FP64 GFLOPs | |||||||
1st NB2 | 580/2U | Radeon X1900 XTX | R580 | 512 | 48 | 600 | 83.2 | GDDR3 GDDR3 Graphics Double Data Rate 3 is a graphics card-specific memory technology, designed by ATI Technologies with the collaboration of JEDEC.It has much the same technological base as DDR2, but the power and heat dispersal requirements have been reduced somewhat, allowing for higher performance memory... |
256 | 1024 | 650 | 375 | N/A | ≤165 | |
2nd NB2 | 9170 | Radeon HD 3870 | RV670 | 64 (320) |
800 | 51.2 | GDDR3 | 256 | 2048 | 800 | 512 | 102.4 NB3 | ≤105 | ||
3rd NB2 | 9250 | Radeon HD 4850 | RV770 | 16,384 | 160 (800) |
625 | 63.5 | GDDR3 | 256 | 1024 | 993 | 1000 | 200 NB3 | ≤150 | |
9270 | Radeon HD 4870 | 750 | 108.8 | GDDR5 GDDR5 GDDR5 SDRAM is a type of high performance DRAM graphics card memory designed for computer applications requiring high bandwidth... |
256 | 2048 | 850 | 1200 | 240 NB3 | <160 | |||||
4th NB2 | 9350 | Radeon HD 5850 | Cypress(RV870) | 31,744 | 288 (1440) |
700 | 128 | GDDR5 | 256 | 2048 | 1000 | 2016 | 403.2 | 150 | codenamed Kestrel |
9370 | Radeon HD 5870 | 320 (1600) |
825 | 147.2 | 256 | 4096 | 1150 | 2640 | 528 | 225 | codenamed Osprey |
Notes:
- NB1: The number of Stream Processing Units (SPU) can only be applied to DirectX 10-compatible hardware and above, which contains unified shaders. Also note that the Stream Processing Units in ATI hardware implementations is architecturally different from NVIDIANVIDIANvidia is an American global technology company based in Santa Clara, California. Nvidia is best known for its graphics processors . Nvidia and chief rival AMD Graphics Techonologies have dominated the high performance GPU market, pushing other manufacturers to smaller, niche roles...
's implementation of Stream Processors in TeslaNvidia TeslaThe Tesla graphics processing unit is nVidia's third brand of GPUs. It is based on high-end GPUs from the G80 , as well as the Quadro lineup. Tesla is nVidia's first dedicated General Purpose GPU...
products. The SP in NVIDIA's implementation has a hot clock domain which runs higher frequency than the other parts of the core, while SPUs in ATI's implementation have the same clock frequency as the core and don't feature a hot clock domain.
- NB2: The first generation of products originally used the ATI FireStream brand, and were re-branded as AMD Stream Processor after the brand reconstruction act that was followed by AMD's acquisition of ATI. AMD refers to the RV670-based AMD FireStream 9170 because no R600-based AMD Stream Processors were released under the stream processing lineup (although prototype cards were publicly demonstrated with similar configurations as the FireGL V8650 without video output capabilities). Since the FireGL 2007 series, the high-end and ultra high-end FireGL products have implemented stream processing support. This feature is also available on all ATI FirePro cards.
- NB3: Estimated to be one-fifth of the theoretical figure for single-precision operations.
Software
The AMD FireStream was launched with a wide range of software platform support. One of the supporting firms was PeakStreamPeakStream
PeakStream was a parallel processing software company located in Redwood Shores, California founded by Matthew Papakipos and Asher Waldfogel in April 2005 and backed by Sequoia Capital and Kleiner Perkins. PeakStream released a high-performance parallel processing library targeting ATI graphics...
(acquired by Google
Google
Google Inc. is an American multinational public corporation invested in Internet search, cloud computing, and advertising technologies. Google hosts and develops a number of Internet-based services and products, and generates profit primarily from advertising through its AdWords program...
in June 2007), who was first to provide an open beta version of software to support CTM and AMD FireStream as well as x86 and Cell
Cell microprocessor
Cell is a microprocessor architecture jointly developed by Sony, Sony Computer Entertainment, Toshiba, and IBM, an alliance known as "STI". The architectural design and first implementation were carried out at the STI Design Center in Austin, Texas over a four-year period beginning March 2001 on a...
(Cell Broadband Engine) processors. The FireStream was claimed to be 20 times faster in typical applications than regular CPUs after running PeakStream's software . RapidMind
RapidMind
RapidMind Inc. was a privately held company founded and headquartered in Waterloo, Ontario, Canada, acquired by Intel in 2009. It provided a software product that aims to make it simpler for software developers to target multi-core processors and accelerators such as GPUs.-History:RapidMind was...
also provided stream processing software that worked with ATI and NVIDIA, as well as Cell processors.
Software Development Kit
AMD first released its Stream Computing SDKSoftware development kit
A software development kit is typically a set of software development tools that allows for the creation of applications for a certain software package, software framework, hardware platform, computer system, video game console, operating system, or similar platform.It may be something as simple...
(v1.0), in December 2007 under the AMD EULA, to be run on Windows XP
Windows XP
Windows XP is an operating system produced by Microsoft for use on personal computers, including home and business desktops, laptops and media centers. First released to computer manufacturers on August 24, 2001, it is the second most popular version of Windows, based on installed user base...
. The SDK includes "Brook+", an AMD hardware optimized version of the Brook
BrookGPU
BrookGPU is the Stanford University graphics group's compiler and runtime implementation of the Brook stream programming language for using modern graphics hardware for non-graphical, general purpose computations...
language developed by Stanford University, itself a variant of the ANSI C
ANSI C
ANSI C refers to the family of successive standards published by the American National Standards Institute for the C programming language. Software developers writing in C are encouraged to conform to the standards, as doing so aids portability between compilers.-History and outlook:The first...
(C language
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....
), open-sourced
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...
and optimized for stream computing. The AMD Core Math Library
AMD Core Math Library
AMD Core Math Library is a software development library released by AMD. This library provides useful mathematical routines optimized for AMD processors....
(ACML) and AMD Performance Library
AMD Performance Library
Framewave is a high-performance optimized library consisting of low level APIs for image processing, signal processing, JPEG and video functionality. These APIs are programmed with task level parallelization and instruction level parallelism resulting in maximum performance on AMD multi-core...
(APL) with optimizations for the AMD FireStream and the COBRA video library (further renamed as "Accelerated Video Transcoding" or AVT) for video transcoding
Transcode
Transcoding is the direct digital-to-digital data conversion of one encoding to another, such as for movie data files or audio files. This is usually done in cases where a target device does not support the format or has limited storage capacity that mandates a reduced file size, or to convert...
acceleration will also be included. Another important part of the SDK, the Compute Abstraction Layer (CAL), is a software development layer aimed for low-level access, through the CTM hardware interface, to the GPU architecture for performance tuning software written in various high-level programming language
Programming language
A programming language is an artificial language designed to communicate instructions to a machine, particularly a computer. Programming languages can be used to create programs that control the behavior of a machine and/or to express algorithms precisely....
s.
In August 2011, AMD released version 2.5 of the ATI APP Software Development Kit, which includes support for OpenCL 1.1
OpenCL
OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language for writing kernels , plus APIs that are used to define and then control the platforms...
, a parallel computing
Parallel computing
Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently . There are several different forms of parallel computing: bit-level,...
language developed by Khronos Group
Khronos Group
The Khronos Group is a not-for-profit member-funded industry consortium based in Beaverton, Oregon, focused on the creation of open standard, royalty-free APIs to enable the authoring and accelerated playback of dynamic media on a wide variety of platforms and devices...
. The concept of compute shaders, officially called DirectCompute, in Microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
's next generation API called DirectX 11 is already included in the graphics drivers with DirectX 11 support.
Advantages
According to an AMD-demonstrated system with two dual-core AMD OpteronOpteron
Opteron is AMD's x86 server and workstation processor line, and was the first processor which supported the AMD64 instruction set architecture . It was released on April 22, 2003 with the SledgeHammer core and was intended to compete in the server and workstation markets, particularly in the same...
processors and two Radeon R600 GPU cores running on Microsoft
Microsoft
Microsoft Corporation is an American public multinational corporation headquartered in Redmond, Washington, USA that develops, manufactures, licenses, and supports a wide range of products and services predominantly related to computing through its various product divisions...
Windows XP Professional, 1 teraflop (TFLOP) can be achieved by a universal multiply-add (MADD) calculation. By comparison, an Intel Core 2 Quad Q9650 3.0 GHz processor can achieve up to 48 GFLOPS.
Recent demonstrations showed that, in Kaspersky SafeStream anti-virus scanning tests optimized for AMD stream processors, the system with two AMD stream processors with dual Opteron processors spotted 6.2 Gbit/s (775 MiB/s) bandwidth, 21 times faster when compared to other dual-processor systems. The stream processor systems also showed only 1-2% CPU utilization, which indicates significant offloading from the CPU to the stream processor.
Limitations
- Recursive functionsRecursion (computer science)Recursion in computer science is a method where the solution to a problem depends on solutions to smaller instances of the same problem. The approach can be applied to many types of problems, and is one of the central ideas of computer science....
are not supported in Brook+ because all function calls are inlinedInline expansionIn computing, inline expansion, or inlining, is a manual or compiler optimization that replaces a function call site with the body of the callee. This optimization may improve time and space usage at runtime, at the possible cost of increasing the final size of the program In computing, inline...
at compile time. Using CAL, functions (recursive or otherwise) are supported to 32 levels. - Only bilinear texture filtering is supported; mipmapMipmapIn 3D computer graphics texture filtering, MIP maps are pre-calculated, optimized collections of images that accompany a main texture, intended to increase rendering speed and reduce aliasing artifacts. They are widely used in 3D computer games, flight simulators and other 3D imaging systems. The...
ped textures and anisotropic filteringAnisotropic filteringIn 3D computer graphics, anisotropic filtering is a method of enhancing the image quality of textures on surfaces that are at oblique viewing angles with respect to the camera where the projection of the texture appears to be non-orthogonal In 3D computer graphics, anisotropic filtering...
are not supported at this time. - Various deviations from the IEEE 754 standard. Denormal numberDenormal numberIn computer science, denormal numbers or denormalized numbers fill the underflow gap around zero in floating point arithmetic: any non-zero number which is smaller than the smallest normal number is 'sub-normal'.For example, if the smallest positive 'normal' number is 1×β−n In computer...
s and signaling NaNNaNIn computing, NaN is a value of the numeric data type representing an undefined or unrepresentable value, especially in floating-point calculations...
s are not supported; the roundingRoundingRounding a numerical value means replacing it by another value that is approximately equal but has a shorter, simpler, or more explicit representation; for example, replacing $23.4476 with $23.45, or the fraction 312/937 with 1/3, or the expression √2 with 1.414.Rounding is often done on purpose to...
mode cannot be changed, and the precision of division/square root is slightly lower than single-precision. - Functions cannot have a variable number of arguments. The same problem occurs for recursive functions.
- Conversion of floating-point numbers to integers on GPUs is done differently than on x86 CPUs; it is not fully IEEE-754 compliant.
- Doing "global synchronization" on the GPU is not very efficient, which forces the GPU to divide the kernelKernel (computing)In computing, the kernel is the main component of most computer operating systems; it is a bridge between applications and the actual data processing done at the hardware level. The kernel's responsibilities include managing the system's resources...
and do synchronization on the CPU. Given the variable number of multiprocessors and other factors, there may not be a perfect solution to this problem. - The bus bandwidth and latency between the CPU and the GPU may become a bottleneckBottleneckA bottleneck is a phenomenon where the performance or capacity of an entire system is limited by a single or limited number of components or resources. The term bottleneck is taken from the 'assets are water' metaphor. As water is poured out of a bottle, the rate of outflow is limited by the width...
, which may be alleviated in the future by introducing interconnects with higher bandwidth.
See also
- ATI TechnologiesATI TechnologiesATI Technologies Inc. was a semiconductor technology corporation based in Markham, Ontario, Canada, that specialized in the development of graphics processing units and chipsets. Founded in 1985 as Array Technologies Inc., the company was listed publicly in 1993 and was acquired by Advanced Micro...
- Stream ProcessingStream processingStream processing is a computer programming paradigm, related to SIMD , that allows some applications to more easily exploit a limited form of parallel processing...
- NVIDIA TeslaNvidia TeslaThe Tesla graphics processing unit is nVidia's third brand of GPUs. It is based on high-end GPUs from the G80 , as well as the Quadro lineup. Tesla is nVidia's first dedicated General Purpose GPU...
- Open Computing Language (OpenCLOpenCLOpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language for writing kernels , plus APIs that are used to define and then control the platforms...
) - The cross-platform standard supported by both NVIDIANVIDIANvidia is an American global technology company based in Santa Clara, California. Nvidia is best known for its graphics processors . Nvidia and chief rival AMD Graphics Techonologies have dominated the high performance GPU market, pushing other manufacturers to smaller, niche roles...
and AMD/ATIATI TechnologiesATI Technologies Inc. was a semiconductor technology corporation based in Markham, Ontario, Canada, that specialized in the development of graphics processing units and chipsets. Founded in 1985 as Array Technologies Inc., the company was listed publicly in 1993 and was acquired by Advanced Micro... - Compute Unified Device Architecture (CUDACUDACUDA or Compute Unified Device Architecture is a parallel computing architecture developed by Nvidia. CUDA is the computing engine in Nvidia graphics processing units that is accessible to software developers through variants of industry standard programming languages...
) - A parallel computing architecture developed by NVIDIA
External links
- ATI Stream Technology FAQ
- ATI Stream published papers and presentations
- ATI Stream SDK
- AnandTech article on distributed computing
- AMD Intermediate Language Reference Guide (CAL) v2.0 Feb '09