Asynchronous Array of Simple Processors

The asynchronous array of simple processors (AsAP) architecture comprises a 2-D array of reduced complexity programmable processors with small memories interconnected by a reconfigurable mesh network. AsAP was developed by researchers in the VLSI Computation Laboratory (VCL) at the University of California, Davis

University of California, Davis

The University of California, Davis is a public teaching and research university established in 1905 and located in Davis, California, USA. Spanning over , the campus is the largest within the University of California system and third largest by enrollment...

and achieves high performance and energy-efficiency, while using a relatively small circuit area.

AsAP processors are well suited for implementation in future fabrication technologies, and are clocked in a globally asynchronous locally synchronous (GALS) fashion. Individual oscillators fully halt (leakage only) in 9 cycles when there is no work to do, and restart at full speed in less than one cycle after work is available. The chip requires no crystal oscillators, PLLs, DLLs, or any global frequency or phase-related signals whatsoever.

The multi-processor architecture efficiently makes use of task-level parallelism in many complex DSP applications, and also efficiently computes many large tasks using fine-grain parallelism.

Key features

AsAP uses several novel key features, of which four are:

Chip multi-processor (CMP) architecture designed to achieve high performance and low power for many DSP applications.
Small memories and a simple architecture in each processor to achieve high energy efficiency.
Globally asynchronous locally synchronous (GALS) clocking simplifies the clock design, greatly increases ease of scalability, and can be used to further reduce power dissipation.
Inter-processor communication is performed by a nearest neighbor network to avoid long global wires and increase scalability to large arrays and in advanced fabrication technologies. Each processor can receive data from any two neighbors and send data to any combination of its four neighbors.

AsAP 1 chip: 36 processors

A chip containing 36 (6x6) programmable processors was taped-out in May 2005 in 0.18μm CMOS using a synthesized standard cell technology and is fully functional. Processors on the chip operate at clock rates from 520MHz to 540MHz at 1.8V and each processor dissipates 32mW on average while executing applications at 475MHz.

Most processors run at clock rates over 600MHz at 2.0V, which makes AsAP among the highest known clock rate fabricated processors (programmable or non-programmable) ever designed in a university; it is the second highest known in published research papers.

At 0.9V, the average application power per processor is 2.4mW at 116MHz. Each processor occupies only 0.66mm².

AsAP 2 chip: 167 processors

A second generation 65 nm CMOS design contains 167 processors with dedicated fast Fourier transform

Fast Fourier transform

A fast Fourier transform is an efficient algorithm to compute the discrete Fourier transform and its inverse. "The FFT has been called the most important numerical algorithm of our lifetime ." There are many distinct FFT algorithms involving a wide range of mathematics, from simple...

(FFT), Viterbi decoder

Viterbi decoder

A Viterbi decoder uses the Viterbi algorithm for decoding a bitstream that has beenencoded using forward error correction based on a convolutional code....

, and video motion estimation

Motion estimation

Motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion is in three dimensions but the images are a projection of the 3D scene onto a 2D...

processors; 16 KB shared memories; and long-distance inter-processor interconnect. The programmable processors can individually and dynamically change their supply voltage and clock frequency. The chip is fully functional. Processors operate up to 1.2 GHz at 1.3 V which is believed to be the highest clock rate fabricated processor designed in any university. At 1.2 V, they operate at 1.07 GHz and 47 mW when 100% active. At 0.675 V, they operate at 66 MHz and 608 μW when 100% active. This operating point enables 1 trillion MAC

Multiply-accumulate

In computing, especially digital signal processing, the multiply–accumulate operation is a common step that computes the product of two numbers and adds that product to an accumulator. The hardware unit that performs the operation is known as a multiplier–accumulator ; the operation itself is also...

or arithmetic logic unit

Arithmetic logic unit

In computing, an arithmetic logic unit is a digital circuit that performs arithmetic and logical operations.The ALU is a fundamental building block of the central processing unit of a computer, and even the simplest microprocessors contain one for purposes such as maintaining timers...

(ALU) ops/sec with a power dissipation of only 9.2 watts. Due to its MIMD

MIMD

In computing, MIMD is a technique employed to achieve parallelism. Machines using MIMD have a number of processors that function asynchronously and independently. At any time, different processors may be executing different instructions on different pieces of data...

architecture and fine-grain clock oscillator stalling, this energy efficiency per operation is almost perfectly constant across widely varying workloads, which is not the case for many architectures.

Applications

The coding of many DSP and general tasks for AsAP has been completed. Mapped tasks include:
filters, convolutional coders, interleavers, sorting, square root, CORDIC

CORDIC

CORDIC is a simple and efficient algorithm to calculate hyperbolic and trigonometric functions...

sin/cos/arcsin/arccos, matrix multiplication, pseudo random number generators, fast Fourier transform

Fast Fourier transform

s (FFTs) of lengths 32-1024, a complete k=7 Viterbi decoder

Viterbi decoder

A Viterbi decoder uses the Viterbi algorithm for decoding a bitstream that has beenencoded using forward error correction based on a convolutional code....

, a JPEG

JPEG

In computing, JPEG . The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and image quality. JPEG typically achieves 10:1 compression with little perceptible loss in image quality....

encoder, a complete fully compliant baseband processor for an IEEE 802.11a/g wireless LAN transmitter and receiver, and a complete CAVLC compression block for an H.264

H.264/MPEG-4 AVC

H.264/MPEG-4 Part 10 or AVC is a standard for video compression, and is currently one of the most commonly used formats for the recording, compression, and distribution of high definition video...

encoder.
Blocks plug directly together with no required modifications. Power, throughput, and area results are typically many times better than existing programmable DSP processors.

The architecture enables a clean separation between programming and inter-processor timing handled entirely by hardware. A recently finished C

C (programming language)

C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

compiler and automatic mapping tool further simplify programming.

External links

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.