CUDA - AbsoluteAstronomy.com

CUDA or Compute Unified Device Architecture is a parallel computing architecture developed by Nvidia

NVIDIA

Nvidia is an American global technology company based in Santa Clara, California. Nvidia is best known for its graphics processors . Nvidia and chief rival AMD Graphics Techonologies have dominated the high performance GPU market, pushing other manufacturers to smaller, niche roles...

. CUDA is the computing engine in Nvidia graphics processing unit

Graphics processing unit

A graphics processing unit or GPU is a specialized circuit designed to rapidly manipulate and alter memory in such a way so as to accelerate the building of images in a frame buffer intended for output to a display...

s (GPUs) that is accessible to software developers through variants of industry standard programming languages. Programmers use 'C for CUDA' (C with Nvidia extensions and certain restrictions), compiled through a PathScale Open64

Open64

Open64 is an open source, optimizing compiler for the Itanium and x86-64 microprocessor architectures. It derives from the SGI compilers for the MIPS R10000 processor, called MIPSPro. It was initially released in 2000 as GNU GPL software under the name Pro64. The following year, University of...

C (programming language)

C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

compiler, to code algorithms for execution on the GPU. CUDA architecture shares a range of computational interfaces with two competitors -the Khronos Group

Khronos Group

The Khronos Group is a not-for-profit member-funded industry consortium based in Beaverton, Oregon, focused on the creation of open standard, royalty-free APIs to enable the authoring and accelerated playback of dynamic media on a wide variety of platforms and devices...

's
OpenCL

OpenCL

OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language for writing kernels , plus APIs that are used to define and then control the platforms...

and Microsoft's DirectCompute

DirectCompute

Microsoft DirectCompute is an application programming interface that supports general-purpose computing on graphics processing units on Microsoft Windows Vista and Windows 7. DirectCompute is part of the Microsoft DirectX collection of APIs and was initially released with the DirectX 11 API but...

. Third party wrappers are also available for Python

Python (programming language)

Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

, Perl

Perl

Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

, Fortran

Fortran

Fortran is a general-purpose, procedural, imperative programming language that is especially suited to numeric computation and scientific computing...

, Java

Java (programming language)

Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

, Ruby

Ruby (programming language)

Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...

, Lua, MATLAB

MATLAB

MATLAB is a numerical computing environment and fourth-generation programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages,...

, and IDL, and native support exists in Mathematica

Mathematica

Mathematica is a computational software program used in scientific, engineering, and mathematical fields and other areas of technical computing...

.

CUDA gives developers access to the virtual instruction set and memory of the parallel computational

Parallel computing

Parallel computing is a form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently . There are several different forms of parallel computing: bit-level,...

elements in CUDA GPUs. Using CUDA, the latest Nvidia GPUs become accessible for computation like CPUs. Unlike CPUs however, GPUs have a parallel throughput architecture that emphasizes executing many concurrent threads slowly, rather than executing a single thread very quickly. This approach of solving general purpose problems on GPUs is known as GPGPU

GPGPU

General-purpose computing on graphics processing units is the technique of using a GPU, which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the CPU...

.

In the computer game industry, in addition to graphics rendering, GPUs are used in game physics calculations

Physics processing unit

A physics processing unit is a dedicated microprocessor designed to handle the calculations of physics, especially in the physics engine of video games. Examples of calculations involving a PPU might include rigid body dynamics, soft body dynamics, collision detection, fluid dynamics, hair and...

(physical effects like debris, smoke, fire, fluids); examples include PhysX

PhysX

PhysX is a proprietary realtime physics engine middleware SDK developed by Ageia with the purchase of ETH Zurich spin-off NovodeX in 2004...

and Bullet

Bullet (software)

Bullet is an open source physics engine featuring 3D collision detection, soft body dynamics, and rigid body dynamics. It is used in games, and in visual effects in movies. The Bullet physics library is published under the zlib license. Erwin Coumans, its main author, worked for Sony Computer...

. CUDA has also been used to accelerate non-graphical applications in computational biology

Computational biology

Computational biology involves the development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems...

, cryptography

Cryptography

Cryptography is the practice and study of techniques for secure communication in the presence of third parties...

and other fields by an order of magnitude

Order of magnitude

An order of magnitude is the class of scale or magnitude of any amount, where each class contains values of a fixed ratio to the class preceding it. In its most common usage, the amount being scaled is 10 and the scale is the exponent being applied to this amount...

or more. An example of this is the BOINC

Berkeley Open Infrastructure for Network Computing

The Berkeley Open Infrastructure for Network Computing is an open source middleware system for volunteer and grid computing. It was originally developed to support the SETI@home project before it became useful as a platform for other distributed applications in areas as diverse as mathematics,...

distributed computing

Distributed computing

Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal...

client.

CUDA provides both a low level API and a higher level API. The initial CUDA SDK was made public on 15 February 2007, for Microsoft Windows

Microsoft Windows

Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

and Linux

Linux

Linux is a Unix-like computer operating system assembled under the model of free and open source software development and distribution. The defining component of any Linux system is the Linux kernel, an operating system kernel first released October 5, 1991 by Linus Torvalds...

. Mac OS X

Mac OS X

Mac OS X is a series of Unix-based operating systems and graphical user interfaces developed, marketed, and sold by Apple Inc. Since 2002, has been included with all new Macintosh computer systems...

support was later added in version 2.0, which supersedes the beta released February 14, 2008.
CUDA works with all Nvidia GPUs from the G8x series onwards, including GeForce, Quadro

NVIDIA Quadro

The Nvidia Quadro series of AGP, PCI, and PCI Express graphics cards comes from the NVIDIA Corporation. Their designers aimed to accelerate CAD and DCC , and the cards are usually featured in workstations....

and the Tesla

Nvidia Tesla

The Tesla graphics processing unit is nVidia's third brand of GPUs. It is based on high-end GPUs from the G80 , as well as the Quadro lineup. Tesla is nVidia's first dedicated General Purpose GPU...

line. CUDA is compatible with most standard operating systems. Nvidia states that programs developed for the G8x series will also work without modification on all future Nvidia video cards, due to binary compatibility.

Background

The GPU, as a specialized processor, addresses the demands of real-time

Real-time computer graphics

Real-time computer graphics is the subfield of computer graphics focused on producing and analyzing images in real time. The term is most often used in reference to interactive 3D computer graphics, typically using a GPU, with video games the most noticeable users...

high-resolution 3D graphics compute-intensive tasks. GPUs have evolved into highly parallel multi core systems allowing very efficient manipulation of large blocks of data. This design is more effective than general-purpose CPUs

Central processing unit

The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

for algorithm

Algorithm

In mathematics and computer science, an algorithm is an effective method expressed as a finite list of well-defined instructions for calculating a function. Algorithms are used for calculation, data processing, and automated reasoning...

s where processing of large blocks of data is done in parallel, such as:

push-relabel maximum flow algorithm
fast sort algorithms of large lists
two-dimensional fast wavelet transform
Fast wavelet transform
The Fast Wavelet Transform is a mathematical algorithm designed to turn a waveform or signal in the time domain into a sequence of coefficients based on an orthogonal basis of small finite waves, or wavelets...

For instance, the parallel nature of molecular dynamics

Molecular dynamics

Molecular dynamics is a computer simulation of physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a period of time, giving a view of the motion of the atoms...

simulations is suitable for CUDA implementation.

Advantages

CUDA has several advantages over traditional general-purpose computation on GPUs (GPGPU) using graphics APIs:

Scattered reads – code can read from arbitrary addresses in memory
Shared memory
Shared memory
In computing, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Depending on context, programs may run on a single processor or on multiple separate processors...

– CUDA exposes a fast shared memory
Scratchpad RAM
Scratchpad memory , also known as scratchpad, scatchpad RAM or local store in computer terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress...

region (up to 48KB per Multi-Processor) that can be shared amongst threads. This can be used as a user-managed cache, enabling higher bandwidth than is possible using texture lookups.
Faster downloads and readbacks to and from the GPU
Full support for integer and bitwise operations, including integer texture lookups

Limitations

Texture rendering is not supported (CUDA 3.2 and up addresses this by introducing "surface writes" to cuda Arrays, the underlying opaque data structure).
Copying between host and device memory may incur a performance hit due to system bus bandwidth and latency (this can be partly alleviated with asynchronous memory transfers, handled by the GPU's DMA engine)
Threads should be running in groups of at least 32 for best performance, with total number of threads numbering in the thousands. Branches in the program code do not impact performance significantly, provided that each of 32 threads takes the same execution path; the SIMD
SIMD
Single instruction, multiple data , is a class of parallel computers in Flynn's taxonomy. It describes computers with multiple processing elements that perform the same operation on multiple data simultaneously...

execution model becomes a significant limitation for any inherently divergent task (e.g. traversing a space partitioning
Space partitioning
In mathematics, space partitioning is the process of dividing a space into two or more disjoint subsets . In other words, space partitioning divides a space into non-overlapping regions...

data structure during ray tracing).
Unlike OpenCL
OpenCL
OpenCL is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language for writing kernels , plus APIs that are used to define and then control the platforms...

, CUDA-enabled GPUs are only available from Nvidia
Valid C/C++ may sometimes be flagged and prevent compilation due to optimization techniques the compiler is required to employ to use limited resources.
CUDA (with compute capability 1.x) uses a recursion-free, function-pointer-free subset of the C language, plus some simple extensions. However, a single process must run spread across multiple disjoint memory spaces, unlike other C language runtime environments.
CUDA (with compute capability 2.x) allows a subset of C++ class functionality, for example member functions may not be virtual (this restriction will be removed in some future release). [See CUDA C Programming Guide 3.1 - Appendix D.6]
Double precision (CUDA compute capability 1.3 and above) deviate from the IEEE 754 standard: round-to-nearest-even is the only supported rounding mode for reciprocal, division, and square root. In single precision
Single precision floating-point format
Single-precision floating-point format is a computer number format that occupies 4 bytes in computer memory and represents a wide dynamic range of values by using a floating point....

, denormals
Denormal number
In computer science, denormal numbers or denormalized numbers fill the underflow gap around zero in floating point arithmetic: any non-zero number which is smaller than the smallest normal number is 'sub-normal'.For example, if the smallest positive 'normal' number is 1×β−n In computer...

and signalling NaN
NaN
In computing, NaN is a value of the numeric data type representing an undefined or unrepresentable value, especially in floating-point calculations...

s are not supported; only two IEEE rounding
Rounding
Rounding a numerical value means replacing it by another value that is approximately equal but has a shorter, simpler, or more explicit representation; for example, replacing $23.4476 with $23.45, or the fraction 312/937 with 1/3, or the expression √2 with 1.414.Rounding is often done on purpose to...

modes are supported (chop and round-to-nearest even), and those are specified on a per-instruction basis rather than in a control word; and the precision of division/square root is slightly lower than single precision.

Supported GPUs

Compute capability table (version of CUDA supported) by GPU and card. Also available directly from Nvidia

Compute capability (version)	GPUs	Cards
1.0	G80, G92, G92b, G94, G94b	GeForce 8800GTX/Ultra, 9400GT, 9600GT, 9800GT, Tesla C/D/S870, FX4/5600, 360M, GT 420
1.1	G86, G84, G98, G96, G96b, G94, G94b, G92, G92b	GeForce 8400GS/GT, 8600GT/GTS, 8800GT/GTS, 9600 GSO, 9800GTX/GX2, GTS 250, GT 120/30/40, FX 4/570, 3/580, 17/18/3700, 4700x2, 1xxM, 32/370M, 3/5/770M, 16/17/27/28/36/37/3800M, NVS420/50
1.2	GT218, GT216, GT215	GeForce 210, GT 220/40, FX380 LP, 1800M, 370/380M, NVS 2/3100M
1.3	GT200, GT200b	GeForce GTX 260, GTX 275, GTX 280, GTX 285, GTX 295, Tesla C/M1060, S1070, Quadro CX, FX 3/4/5800
2.0	GF100, GF110	GeForce (GF100) GTX 465, GTX 470, GTX 480, Tesla C2050, C2070, S/M2050/70, Quadro Plex 7000, GeForce (GF110) GTX570, GTX580, GTX590
2.1	GF104, GF114, GF116, GF108, GF106	GeForce GT 430, GT 440, GTS 450, GTX 460, GTX 550 Ti, GTX 560, GTX 560 Ti, 500M, Quadro 600, 2000, 4000, 5000, 6000

A table of devices officially supporting CUDA (Note that many applications require at least 256 MB of dedicated VRAM, and some recommend at least 96 cuda cores).

see full list here: http://developer.nvidia.com/cuda-gpus
|- valign="top"
|
{| class="standard"
!Nvidia GeForce

GeForce

GeForce is a brand of graphics processing units designed by Nvidia. , there have been eleven iterations of the design. The first GeForce products were discrete GPUs designed for use on add-on graphics boards, intended for the high-margin PC gaming market...

|-
|GeForce GTX 590
|-
|GeForce GTX 580
|-
|GeForce GTX 570
|-
|GeForce GTX 560 Ti
|-
|GeForce GTX 560
|-
|GeForce GTX 550 Ti
|-
|GeForce GTX 480
|-
|GeForce GTX 470
|-
|GeForce GTX 465
|-
|GeForce GTX 460
|-
|GeForce GTX 460 SE
|-
|GeForce GTS 450
|-
|GeForce GT 440
|-
|GeForce GT 430
|-
|GeForce GT 420
|-
|GeForce GTX 295
|-
|GeForce GTX 285
|-
|GeForce GTX 280
|-
|GeForce GTX 275
|-
|GeForce GTX 260
|-
|GeForce GTS 250
|-
|GeForce GTS 240
|-
|GeForce GT 240
|-
|GeForce GT 220
|-
|GeForce 210/G210
|-
|GeForce GT 140
|-
|GeForce 9800 GX2
|-
|GeForce 9800 GTX+
|-
|GeForce 9800 GTX
|-
|GeForce 9800 GT
|-
|GeForce 9600 GSO
|-
|GeForce 9600 GT
|-
|GeForce 9500 GT
|-
|GeForce 9400 GT
|-
|GeForce 9400 mGPU
|-
|GeForce 9300 mGPU
|-
|GeForce 9100 mGPU
|-
|GeForce 8800 Ultra
|-
|GeForce 8800 GTX
|-
|GeForce 8800 GTS
|-
|GeForce 8800 GT
|-
|GeForce 8800 GS
|-
|GeForce 8600 GTS
|-
|GeForce 8600 GT
|-
|GeForce 8600 mGT
|-
|GeForce 8500 GT
|-
|GeForce 8400 GS
|-
|GeForce 8300 mGPU
|-
|GeForce 8200 mGPU
|-
|GeForce 8100 mGPU
|}
|
{| class="standard"
!Nvidia GeForce Mobile

GeForce

|-
|GeForce GTX 580M
|-
|GeForce GTX 570M
|-
|GeForce GTX 560M
|-
|GeForce GT 555M
|-
|GeForce GT 550M
|-
|GeForce GT 540M
|-
|GeForce GT 525M
|-
|GeForce GT 520M
|-
|GeForce GTX 480M
|-
|GeForce GTX 470M
|-
|GeForce GTX 460M
|-
|GeForce GT 445M
|-
|GeForce GT 435M
|-
|GeForce GT 425M
|-
|GeForce GT 420M
|-
|GeForce GT 415M
|-
|GeForce GTX 285M
|-
|GeForce GTX 280M
|-
|GeForce GTX 260M
|-
|GeForce GTS 360M
|-
|GeForce GTS 350M
|-
|GeForce GTS 260M
|-
|GeForce GTS 250M
|-
|GeForce GT 335M
|-
|GeForce GT 330M
|-
|GeForce GT 325M
|-
|GeForce GT 320M
|-
|-
|GeForce 310M
|-
|GeForce GT 240M
|-
|GeForce GT 230M
|-
|GeForce GT 220M
|-
|GeForce G210M
|-
|GeForce GTS 160M
|-
|GeForce GTS 150M
|-
|GeForce GT 130M
|-
|GeForce GT 120M
|-
|GeForce G110M
|-
|GeForce G105M
|-
|GeForce G103M
|-
|GeForce G102M
|-
|GeForce G100
|-
|GeForce 9800M GTX
|-
|GeForce 9800M GTS
|-
|GeForce 9800M GT
|-
|GeForce 9800M GS
|-
|GeForce 9700M GTS
|-
|GeForce 9700M GT
|-
|GeForce 9650M GT
|-
|GeForce 9650M GS
|-
|GeForce 9600M GT
|-
|GeForce 9600M GS
|-
|GeForce 9500M GS
|-
|GeForce 9500M G
|-
|GeForce 9400M G
|-
|GeForce 9300M GS
|-
|GeForce 9300M G
|-
|GeForce 9200M GS
|-
|GeForce 9100M G
|-
|GeForce 8800M GTX
|-
|GeForce 8800M GTS
|-
|GeForce 8700M GT
|-
|GeForce 8600M GT
|-
|GeForce 8600M GS
|-
|GeForce 8400M GT
|-
|GeForce 8400M GS
|-
|GeForce 8400M G
|-
|GeForce 8200M G
|}
|
{| class="standard"
!Nvidia Quadro

NVIDIA Quadro

|-
|Quadro 6000
|-
|Quadro 5000
|-
|Quadro 4000
|-
|Quadro 2000
|-
|Quadro 600
|-
|Quadro FX 5800
|-
|Quadro FX 5600
|-
|Quadro FX 4800
|-
|Quadro FX 4700 X2
|-
|Quadro FX 4600
|-
|Quadro FX 3800
|-
|Quadro FX 3700
|-
|Quadro FX 1800
|-
|Quadro FX 1700
|-
|Quadro FX 580
|-
|Quadro FX 570
|-
|Quadro FX 380
|-
|Quadro FX 370
|-
|Quadro NVS 450
|-
|Quadro NVS 420
|-
|Quadro NVS 295
|-
|Quadro NVS 290
|-
|Quadro Plex 1000 Model IV
|-
|Quadro Plex 1000 Model S4
|}
{| class="standard"
!Nvidia Quadro Mobile

NVIDIA Quadro

|-
|Quadro 5010M
|-
|Quadro 5000M
|-
|Quadro 4000M
|-
|Quadro 3000M
|-
|Quadro 2000M
|-
|Quadro 1000M
|-
|Quadro FX 3800M
|-
|Quadro FX 3700M
|-
|Quadro FX 3600M
|-
|Quadro FX 2800M
|-
|Quadro FX 2700M
|-
|Quadro FX 1800M
|-
|Quadro FX 1700M
|-
|Quadro FX 1600M
|-
|Quadro FX 880M
|-
|Quadro FX 770M
|-
|Quadro FX 570M
|-
|Quadro FX 380M
|-
|Quadro FX 370M
|-
|Quadro FX 360M
|-
|Quadro NVS 320M
|-
|Quadro NVS 160M
|-
|Quadro NVS 150M
|-
|Quadro NVS 140M
|-
|Quadro NVS 135M
|-
|Quadro NVS 130M
|}
{| class="standard"
!Nvidia Tesla

Nvidia Tesla

The Tesla graphics processing unit is nVidia's third brand of GPUs. It is based on high-end GPUs from the G80 , as well as the Quadro lineup. Tesla is nVidia's first dedicated General Purpose GPU...

|-
|Tesla C2050/2070
|-
|Tesla M2050/M2070
|-
|Tesla S2050
|-
|Tesla S1070
|-
|Tesla M1060
|-
|Tesla C1060
|-
|Tesla C870
|-
|Tesla D870
|-
|Tesla S870
|}>

Version features and specifications

Feature support (unlisted features are supported for all compute capabilities)	Compute capability (version)
	1.0	1.1	1.2	1.3	2.x
Integer atomic functions operating on 32-bit words in global memory	colspan="1" rowspan="2"	colspan="4" rowspan="2"
atomicExch operating on 32-bit floating point values in global memory
Integer atomic functions operating on 32-bit words in shared memory	colspan="2" rowspan="4"	colspan="3" rowspan="4"
atomicExch operating on 32-bit floating point values in shared memory
Integer atomic functions operating on 64-bit words in global memory
Warp vote functions
Double-precision floating-point operations	colspan="3" rowspan="1"	colspan="2" rowspan="1"
Atomic functions operating on 64-bit integer values in shared memory	colspan="4" rowspan="7"	colspan="1" rowspan="7"
Floating-point atomic addition operating on 32-bit words in global and shared memory
_ballot
_threadfence_system
_syncthreads_count, _syncthreads_and, _syncthreads_or
Surface functions
3D grid of thread block

Technical specifications	Compute capability (version)
Technical specifications	1.0	1.1	1.2	1.3	2.x
Maximum dimensionality of grid of thread blocks	colspan="4"
Maximum x-, y-, or z-dimension of a grid of thread blocks	colspan="5"
Maximum dimensionality of thread block	colspan="5"
Maximum x- or y-dimension of a block	colspan="4"
Maximum z-dimension of a block	colspan="5"
Maximum number of threads per block	colspan="4"
Warp size	colspan="5"
Maximum number of resident blocks per multiprocessor	colspan="5"
Maximum number of resident warps per multiprocessor	colspan="2"	colspan="2"
Maximum number of resident threads per multiprocessor	colspan="2"	colspan="2"
Number of 32-bit registers per multiprocessor	colspan="2"	colspan="2"
Maximum amount of shared memory per multiprocessor	colspan="4"
Number of shared memory banks	colspan="4"
Amount of local memory per thread	colspan="4"
Constant memory size	colspan="5"
Cache working set per multiprocessor for constant memory	colspan="5"
Cache working set per multiprocessor for texture memory	colspan="5"
Maximum width for 1D texture reference bound to a CUDA array	colspan="4"
Maximum width for 1D texture reference bound to linear memory	colspan="5"
Maximum width and number of layers for a 1D layered texture reference	colspan="4"
Maximum width and height for 2D texture reference bound to linear memory or a CUDA array	colspan="4"	colspan="1"
Maximum width, height, and number of layers for a 2D layered texture reference	colspan="4"
Maximum width, height and depth for a 3D texture reference bound to linear memory or a CUDA array	colspan="5"
Maximum number of textures that can be bound to a kernel	colspan="5"
Maximum width for a 1D surface reference bound to a CUDA array	colspan="4" rowspan="3"
Maximum width and height for a 2D surface reference bound to a CUDA array
Maximum number of surfaces that can be bound to a kernel
Maximum number of instructions per kernel	colspan="5"

Architecture specifications	Compute capability (version)
Architecture specifications	1.0	1.1	1.2	1.3	2.0	2.1
Number of cores for integer and floating-point arithmetic functions operations	colspan="4"	colspan="1"	colspan="1"
Number of special function units for single-precision floating-point transcendental functions	colspan="4"	colspan="1"	colspan="1"
Number of texture filtering units for every texture address unit or Render Output Unit (ROP)	colspan="4"	colspan="1"	colspan="1"
Number of warp schedulers	colspan="4"	colspan="1"	colspan="1"
Number of instructions issued at once by scheduler	colspan="4"	colspan="1"	colspan="1"

For more information please visit this site: http://www.geeks3d.com/20100606/gpu-computing-nvidia-cuda-compute-capability-comparative-table/ and also read Nvidia CUDA programming guide.

Example

This example code in C++ loads a texture from an image into an array on the GPU:

texture tex;

void foo
{
cudaArray* cu_array;

// Allocate array
cudaChannelFormatDesc description = cudaCreateChannelDesc;
cudaMallocArray(&cu_array, &description, width, height);

// Copy image data to array
cudaMemcpyToArray(cu_array, image, width*height*sizeof(float), cudaMemcpyHostToDevice);

// Set texture parameters (default)
tex.addressMode[0] = cudaAddressModeClamp;
tex.addressMode[1] = cudaAddressModeClamp;
tex.filterMode = cudaFilterModePoint;
tex.normalized = false; // do not normalize coordinates

// Bind the array to the texture
cudaBindTextureToArray(tex, cu_array);

// Run kernel
dim3 blockDim(16, 16, 1);
dim3 gridDim((width + blockDim.x - 1)/ blockDim.x, (height + blockDim.y - 1) / blockDim.y, 1);
kernel<<< gridDim, blockDim, 0 >>>(d_data, height, width);

// Unbind the array from the texture
cudaUnbindTexture(tex);
} //end foo

__global__ void kernel(float* odata, int height, int width)
{
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
if (x < width && y < height) {
float c = tex2D(tex, x, y);
odata[y*width+x] = c;
}
}

Below is an example given in Python that computes the product of two arrays on the GPU. The unofficial Python language bindings can be obtained from PyCUDA.

import pycuda.compiler as comp
import pycuda.driver as drv
import numpy
import pycuda.autoinit

mod = comp.SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
const int i = threadIdx.x;
dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
drv.Out(dest), drv.In(a), drv.In(b),
block=(400,1,1))

print dest-a*b

Additional Python bindings to simplify matrix multiplication operations can be found in the program pycublas.

import numpy
from pycublas import CUBLASMatrix
A = CUBLASMatrix( numpy.mat(1,2,3],[4,5,6,numpy.float32) )
B = CUBLASMatrix( numpy.mat(2,3],[4,5],[6,7,numpy.float32) )
C = A*B
print C.np_mat

Language bindings

Fortran
Fortran
Fortran is a general-purpose, procedural, imperative programming language that is especially suited to numeric computation and scientific computing...

- FORTRAN CUDA, PGI CUDA Fortran Compiler
Lua - KappaCUDA
IDL - GPULib
Mathematica
Mathematica
Mathematica is a computational software program used in scientific, engineering, and mathematical fields and other areas of technical computing...

- CUDALink
MATLAB
MATLAB
MATLAB is a numerical computing environment and fourth-generation programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages,...

- Parallel Computing Toolbox, Distributed Computing Server, and 3rd party packages like Jacket
Jacket (software)
Jacket is a numerical computing platform enabling GPU acceleration of MATLAB-based codes. Developed by AccelerEyes, Jacket allows GPU-based matrix manipulations, plotting of functions and data, implementation of algorithms, and interfacing with programs written in other languages, including C, C++,...

.
.NET
.NET Framework
The .NET Framework is a software framework that runs primarily on Microsoft Windows. It includes a large library and supports several programming languages which allows language interoperability...

- CUDA.NET
Perl
Perl
Perl is a high-level, general-purpose, interpreted, dynamic programming language. Perl was originally developed by Larry Wall in 1987 as a general-purpose Unix scripting language to make report processing easier. Since then, it has undergone many changes and revisions and become widely popular...

- KappaCUDA, CUDA::Minimal
Python
Python (programming language)
Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

- PyCUDA KappaCUDA
Ruby
Ruby (programming language)
Ruby is a dynamic, reflective, general-purpose object-oriented programming language that combines syntax inspired by Perl with Smalltalk-like features. Ruby originated in Japan during the mid-1990s and was first developed and designed by Yukihiro "Matz" Matsumoto...

- KappaCUDA
Java
Java (programming language)
Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

- jCUDA, JCuda, JCublas, JCufft
Haskell
Haskell (programming language)
Haskell is a standardized, general-purpose purely functional programming language, with non-strict semantics and strong static typing. It is named after logician Haskell Curry. In Haskell, "a function is a first-class citizen" of the programming language. As a functional programming language, the...

- Data.Array.Accelerate
.NET
.NET Framework
The .NET Framework is a software framework that runs primarily on Microsoft Windows. It includes a large library and supports several programming languages which allows language interoperability...

- CUDAfy.NET .NET kernel and host code, CURAND, CUBLAS, CUFFT.

Current CUDA architectures

The current generation CUDA architecture (codename: "Fermi") which is standard on Nvidia's released (GeForce 400 Series

GeForce 400 Series

The GeForce 400 Series is the 11th generation of Nvidia's GeForce graphics processing units. The series was originally slated for production in November 2009, but, after a number of delays, launched on March 26, 2010 with availability following in April 2010....

[GF100] (GPU) 2010-03-27) GPU is designed from the ground up to natively support more programming languages such as C++

C++

C++ is a statically typed, free-form, multi-paradigm, compiled, general-purpose programming language. It is regarded as an intermediate-level language, as it comprises a combination of both high-level and low-level language features. It was developed by Bjarne Stroustrup starting in 1979 at Bell...

. It has eight times the peak double-precision floating-point performance compared to Nvidia's previous-generation Tesla

Nvidia Tesla

The Tesla graphics processing unit is nVidia's third brand of GPUs. It is based on high-end GPUs from the G80 , as well as the Quadro lineup. Tesla is nVidia's first dedicated General Purpose GPU...

GPU. It also introduced several new features including:

up to 1024 CUDA cores and 3.0 billion transistors on the GTX 590
Nvidia Parallel DataCache technology
Nvidia GigaThread engine
ECC memory support
Native support for Visual Studio

Current and future usages of CUDA architecture

Accelerated rendering of 3D graphics
Accelerated interconversion of video file formats
Accelerated encryption
Encryption
In cryptography, encryption is the process of transforming information using an algorithm to make it unreadable to anyone except those possessing special knowledge, usually referred to as a key. The result of the process is encrypted information...

, decryption and compression
Data compression
In computer science and information theory, data compression, source coding or bit-rate reduction is the process of encoding information using fewer bits than the original representation would use....
Distributed Calculations, such as predicting the native conformation of proteins
Medical analysis simulations, for example virtual reality
Virtual reality
Virtual reality , also known as virtuality, is a term that applies to computer-simulated environments that can simulate physical presence in places in the real world, as well as in imaginary worlds...

based on CT and MRI
Magnetic resonance imaging
Magnetic resonance imaging , nuclear magnetic resonance imaging , or magnetic resonance tomography is a medical imaging technique used in radiology to visualize detailed internal structures...

scan images.
Physical simulations, in particular in fluid dynamics
Fluid dynamics
In physics, fluid dynamics is a sub-discipline of fluid mechanics that deals with fluid flow—the natural science of fluids in motion. It has several subdisciplines itself, including aerodynamics and hydrodynamics...

.
Real Time Cloth Simulation OptiTex.com - Real Time Cloth Simulation
The Search for Extra-Terrestrial Intelligence (SETI@Home) program

External links

Now code with Cuda@TechRefined Jul 02,2011
Nvidia CUDA Official site
Nvidia Parallel Nsight
Nvidia CUDA developer registration for professional developers and researchers
Nvidia CUDA GPU Computing developer forums
CUDALink package for Mathematica
Mathematica
Mathematica is a computational software program used in scientific, engineering, and mathematical fields and other areas of technical computing...
Programming Massively Parallel Processors: A Hands-on Approach
Intro to GPGPU computing featuring CUDA and OpenCL examples
Integrating CUDA with GNU Autotools
A conversation with Jen-Hsun Huang, CEO Nvidia Charlie Rose
Charlie Rose
Charles Peete "Charlie" Rose, Jr. is an American television talk show host and journalist. Since 1991 he has hosted Charlie Rose, an interview show distributed nationally by PBS since 1993...

, February 5, 2009
Scientific Publications, Videos and Software using CUDA
How-to Guide: Running CUDA on Visual Studio 2008
Beyond3D – Introducing CUDA Nvidia's Vision for GPU Computing March 10, 2007
University of Illinois Nvidia CUDA Course taught by Wen-mei Hwu
Wen-mei Hwu
Wen-mei Hwu is a professor at University of Illinois at Urbana-Champaign specializing in compiler design, computer architecture, computer microarchitecture, and parallel processing. He currently holds the Walter J. Sanders III-Advanced Micro Devices Endowed Chair in Electrical and Computer...

and David Kirk
David Kirk (scientist)
Dr David Kirk Ph.D. is Nvidia's Chief Scientist.From June 1996 to January 1997, Dr. Kirk was a software and technical management consultant. From 1993 to 1996, Dr. Kirk was Chief Scientist and Head of Technology for Crystal Dynamics, a video game manufacturing company.From 1989 to 1991, Dr. Kirk...

, Spring 2009
University of Wisconsin-Madison Course, Spring 2011
CUDA: Breaking the Intel & AMD Dominance
NVidia CUDA Tutorial Slides (from DoD HPCMP2009)
Ascalaph Liquid GPU molecular dynamics
Molecular dynamics
Molecular dynamics is a computer simulation of physical movements of atoms and molecules. The atoms and molecules are allowed to interact for a period of time, giving a view of the motion of the atoms...

.
CUDA implementation for multi-core processors
Integrate CUDA with Visual C++, September 26, 2008
CUDA.NET - .NET library for CUDA, Linux/Windows compliant
CUDA.CS.MSU.SU Russian CUDA developer community
Enable Intellisense for CUDA in Visual Studio 2008, April 29, 2009
CUDA Tutorials for high performance computing
An introduction to CUDA (French)
NVidia CUDA Tutorial & Examples (from ISC2009)
GPUBrasil.com, First website on GPGPU in Portuguese
3D cloth Simulation OptiTex.com, Implementation of CUDA in Cloth simulation
DualSPHysics, Implementation of CUDA in Smoothed Particle Hydrodynamics
DDJ CUDA series, First in the Doctor Dobb's Journal series teaching CUDA (over 22 articles by Rob Farber)
Technical Report on implementing a real-time fluid simulator with CUDA
CUDA - A demonstration w.r.t Exact String Matching Algorithms
Exploring CUDA via the Euclidean Distance
CUDA Examples
Parallel Computing Center Parallel Computing, using GPU. Creating and porting various application (jCUDA, CUDA C++). Ukraine, Khmelnitskiy National University.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.

Background

Advantages

Limitations

Supported GPUs

Version features and specifications

Example

Language bindings

Current CUDA architectures

Current and future usages of CUDA architecture

See also

External links