Automatically Tuned Linear Algebra Software
Encyclopedia
Automatically Tuned Linear Algebra Software (ATLAS) is a software library
Library (computer science)
In computer science, a library is a collection of resources used to develop software. These may include pre-written code and subroutines, classes, values or type specifications....

 for linear algebra
Linear algebra
Linear algebra is a branch of mathematics that studies vector spaces, also called linear spaces, along with linear functions that input one vector and output another. Such functions are called linear maps and can be represented by matrices if a basis is given. Thus matrix theory is often...

. It provides a mature open source
Open source
The term open source describes practices in production and development that promote access to the end product's source materials. Some consider open source a philosophy, others consider it a pragmatic methodology...

 implementation of BLAS
Basic Linear Algebra Subprograms
Basic Linear Algebra Subprograms is a de facto application programming interface standard for publishing libraries to perform basic linear algebra operations such as vector and matrix multiplication. They were first published in 1979, and are used to build larger packages such as LAPACK...

 APIs
Application programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...

 for C and Fortran77
Fortran
Fortran is a general-purpose, procedural, imperative programming language that is especially suited to numeric computation and scientific computing...

.

ATLAS is often recommended as a way to automatically generate an optimized
Optimization (computer science)
In computer science, program optimization or software optimization is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources...

 BLAS library. While its performance often trails that of specialized libraries written for one specific hardware platform
Platform (computing)
A computing platform includes some sort of hardware architecture and a software framework , where the combination allows software, particularly application software, to run...

, it is often the first or even only optimized BLAS implementation available on new systems and is a large improvement over the generic BLAS available at Netlib
Netlib
Netlib is a repository of software for scientific computing maintained by AT&T, Bell Laboratories, the University of Tennessee and Oak Ridge National Laboratory. Netlib comprises a large number of separate programs and libraries...

. For this reason, ATLAS is sometimes used as a performance baseline for comparison with other products.

ATLAS runs on most Unix
Unix
Unix is a multitasking, multi-user computer operating system originally developed in 1969 by a group of AT&T employees at Bell Labs, including Ken Thompson, Dennis Ritchie, Brian Kernighan, Douglas McIlroy, and Joe Ossanna...

-like operating systems and on Microsoft Windows
Microsoft Windows
Microsoft Windows is a series of operating systems produced by Microsoft.Microsoft introduced an operating environment named Windows on November 20, 1985 as an add-on to MS-DOS in response to the growing interest in graphical user interfaces . Microsoft Windows came to dominate the world's personal...

 (using Cygwin
Cygwin
Cygwin is a Unix-like environment and command-line interface for Microsoft Windows. Cygwin provides native integration of Windows-based applications, data, and other system resources with applications, software tools, and data of the Unix-like environment...

). It is released under a BSD-style license without advertising clause, and many well-known mathematics applications including Sage, MATLAB
MATLAB
MATLAB is a numerical computing environment and fourth-generation programming language. Developed by MathWorks, MATLAB allows matrix manipulations, plotting of functions and data, implementation of algorithms, creation of user interfaces, and interfacing with programs written in other languages,...

, Scilab
Scilab
Scilab is an open source, cross-platform numerical computational package and a high-level, numerically oriented programming language. Itcan be used for signal processing, statistical analysis, image enhancement, fluid dynamics simulations, numerical optimization, and modeling and simulation of...

, Mathematica
Mathematica
Mathematica is a computational software program used in scientific, engineering, and mathematical fields and other areas of technical computing...

, and GNU Octave
GNU Octave
GNU Octave is a high-level language, primarily intended for numerical computations. It provides a convenient command-line interface for solving linear and nonlinear problems numerically, and for performing other numerical experiments using a language that is mostly compatible with MATLAB...

 use it.

Functionality

ATLAS provides a full implementation of the BLAS APIs as well as some additional functions from LAPACK
LAPACK
-External links:* : a modern replacement for PLAPACK and ScaLAPACK* on Netlib.org* * * : a modern replacement for LAPACK that is MultiGPU ready* on Sourceforge.net* * optimized LAPACK for Solaris OS on SPARC/x86/x64 and Linux* * *...

, a higher-level library built on top of BLAS. In BLAS, functionality is divided into three groups called levels 1, 2 and 3.
  • Level 1 contains vector operations of the form


as well as scalar dot product
Dot product
In mathematics, the dot product or scalar product is an algebraic operation that takes two equal-length sequences of numbers and returns a single number obtained by multiplying corresponding entries and then summing those products...

s and vector norm
Norm (mathematics)
In linear algebra, functional analysis and related areas of mathematics, a norm is a function that assigns a strictly positive length or size to all vectors in a vector space, other than the zero vector...

s, among other things.

  • Level 2 contains matrix-vector operations of the form


as well as solving for x with being triangular, among other things.

  • Level 3 contains matrix-matrix operations such as the widely used General Matrix Multiply
    General Matrix Multiply
    The General Matrix Multiply is a subroutine in the Basic Linear Algebra Subprograms which performs matrix multiplication, that is the multiplication of two matrices...

     (GEMM) operation


as well as solving for triangular matrices , among other things.

Optimization approach

The optimization
Optimization (computer science)
In computer science, program optimization or software optimization is the process of modifying a software system to make some aspect of it work more efficiently or use fewer resources...

 approach is called Automated Empirical Optimization of Software (AEOS), which identifies four fundamental approaches to computer assisted optimization of which ATLAS employs three:
  1. Parameter
    Parameter (computer science)
    In computer programming, a parameter is a special kind of variable, used in a subroutine to refer to one of the pieces of data provided as input to the subroutine. These pieces of data are called arguments...

    ization -- searching over the parameter space of a function, used for blocking factor, cache edge, ...
  2. Multiple implementation -- searching through various approaches to implementing the same function, e.g., for SSE
    Streaming SIMD Extensions
    In computing, Streaming SIMD Extensions is a SIMD instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series processors as a reply to AMD's 3DNow! . SSE contains 70 new instructions, most of which work on single precision floating point...

     support before intrinsics made them available in C code
  3. Code generation
    Automatic programming
    In computer science, the term automatic programming identifies a type of computer programming in which some mechanism generates a computer program to allow human programmers to write the code at a higher abstraction level....

     -- programs that write programs incorporating what knowledge they can about what will produce the best performance for the system

  • Optimization of the level 1 BLAS uses parameterization and multiple implementation
Every ATLAS level 1 BLAS function has its own kernel. Since it would be difficult to maintain thousands of cases in ATLAS there is little architecture specific optimization for Level 1 BLAS. Instead multiple implementation is relied upon to allow for compiler optimization
Compiler optimization
Compiler optimization is the process of tuning the output of a compiler to minimize or maximize some attributes of an executable computer program. The most common requirement is to minimize the time taken to execute a program; a less common one is to minimize the amount of memory occupied...

 to produce high performance implementation for the system.

  • Optimization of the level 2 BLAS uses parameterization and multiple implementation
With data and operations to perform the function is usually limited by bandwidth to memory, and thus there is not much opportunity for optimization
All routines in the ATLAS level 2 BLAS are built from two Level 2 BLAS kernels:
  • GEMV -- matrix by vector multiply update:
  • GER -- general rank 1 update from an outer product:

  • Optimization of the level 3 BLAS uses code generation and the other two techniques
Since we have ops with only data, many opportunities for optimization

Level 3 BLAS

Most of the Level 3 BLAS is derived from GEMM
General Matrix Multiply
The General Matrix Multiply is a subroutine in the Basic Linear Algebra Subprograms which performs matrix multiplication, that is the multiplication of two matrices...

, so that is the primary focus of the optimization.
operations vs. data

The intuition that the operations will dominate over the data accesses only works for roughly square matrices.
The real measure should be some kind of surface area to volume.
The difference becomes important for very non-square matrices.

Can it afford to copy?

Copying the inputs allows the data to be arranged in a way that provides optimal access for the kernel functions,
but this comes at the cost of allocating temporary space, and an extra read and write of the inputs.

So the first question GEMM faces is, can it afford to copy the inputs?

If so,
  • Put into block major format with good alignment
  • Take advantage of user contributed kernels and cleanup
  • Handle the transpose cases with the copy: make everything into TN (transpose - no-transpose)
  • Deal with α in the copy


If not,
  • Use the nocopy version
  • Make no assumptions on the stride of matrix A and B in memory
  • Handle all transpose cases explicitly
  • No guarantee about alignment of data
  • Support α specific code
  • Run the risk of TLB
    Translation Lookaside Buffer
    A translation lookaside buffer is a CPU cache that memory management hardware uses to improve virtual address translation speed. All current desktop and server processors use a TLB to map virtual and physical address spaces, and it is ubiquitous in any hardware which utilizes virtual memory.The...

     issues, bad strides, ...


The actual decision is made through a simple heuristic which checks for "skinny cases".

Cache edge

For 2nd Level Cache blocking a single cache edge parameter is used.
The high level choose an order to traverse the blocks: ijk, jik, ikj, jki, kij, kji.
These need not be the same order as the product is done within a block.

Typically chosen orders are ijk or jik.
For jik the ideal situation would be to copy A and the NB wide panel of B.
For ijk swap the role of A and B.

Choosing the bigger of M or N for the outer loop reduces the footprint of the copy.
But for large K ATLAS does not even allocate such a large amount of memory.
Instead it defines a parameter, Kp, to give best use of the L2 cache.
Panels are limited to Kp in length.
It first tries to allocate (in the jik case) .
If that fails it tries .
(If that fails it uses the no-copy version of GEMM, but this case is unlikely for reasonable choices of cache edge.)
Kp is a function of cache edge and NB.

LAPACK

When integrating the ATLAS BLAS with LAPACK
LAPACK
-External links:* : a modern replacement for PLAPACK and ScaLAPACK* on Netlib.org* * * : a modern replacement for LAPACK that is MultiGPU ready* on Sourceforge.net* * optimized LAPACK for Solaris OS on SPARC/x86/x64 and Linux* * *...

 an important consideration is the choice of blocking factor for LAPACK. If the ATLAS blocking factor is small enough the blocking factor of LAPACK could be set to match that of ATLAS.

To take advantage of recursive factorization, ATLAS provides replacement routines for some LAPACK routines. These simply overwrite the corresponding LAPACK routines from Netlib
Netlib
Netlib is a repository of software for scientific computing maintained by AT&T, Bell Laboratories, the University of Tennessee and Oak Ridge National Laboratory. Netlib comprises a large number of separate programs and libraries...

.

Need for installation

Installing ATLAS on a particular platform is a challenging process which is typically done by a system vendor or a local expert and made available to a wider audience.

For many systems, architectural default parameters are available; these are essentially saved searches plus the results of hand tuning.
If the arch defaults work they will likely get 10-15% better performance than the install search. On such systems the installation process is greatly simplified.

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK