Benchmark (computing)
Encyclopedia
This article is about the use of benchmarks in computing, for other uses see benchmark
Benchmark
-Geology:*Benchmark , a point of reference for a measurement**Benchmarking , an activity involving finding benchmarks*Benchmark , used in pricing crude oil-Technology:...

.


In computing
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...

, a benchmark is the act of running a computer program
Computer program
A computer program is a sequence of instructions written to perform a specified task with a computer. A computer requires programs to function, typically executing the program's instructions in a central processor. The program has an executable form that the computer can use directly to execute...

, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it. The term 'benchmark' is also mostly utilized for the purposes of elaborately-designed benchmarking programs themselves.

Benchmarking is usually associated with assessing performance characteristics of computer hardware, for example, the floating point operation performance of a CPU
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...

, but there are circumstances when the technique is also applicable to software. Software benchmarks are, for example, run against compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...

s or database management system
Database management system
A database management system is a software package with computer programs that control the creation, maintenance, and use of a database. It allows organizations to conveniently develop databases for various applications by database administrators and other specialists. A database is an integrated...

s.

Another type of test program, namely test suite
Test suite
In software development, a test suite, less commonly known as a validation suite, is a collection of test cases that are intended to be used to test a software program to show that it has some specified set of behaviours. A test suite often contains detailed instructions or goals for each...

s or validation suites, are intended to assess the correctness of software.

Benchmarks provide a method of comparing the performance of various subsystems across different chip/system architectures.

Purpose

As computer architecture
Computer architecture
In computer science and engineering, computer architecture is the practical art of selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals and the formal modelling of those systems....

 advanced, it became more difficult to compare the performance of various computer systems simply by looking at their specifications. Therefore, tests were developed that allowed comparison of different architectures. For example, Pentium 4
Pentium 4
Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the...

 processors generally operate at a higher clock frequency than Athlon XP processors, which does not necessarily translate to more computational power. A slower processor, with regard to clock frequency, can perform as well as a processor operating at a higher frequency. See BogoMips
BogoMips
BogoMips is an unscientific measurement of CPU speed made by the Linux kernel when it boots, to calibrate an internal busy-loop...

 and the megahertz myth
Megahertz Myth
The megahertz myth, or less commonly the gigahertz myth, refers to the misconception of only using clock rate to compare the performance of different microprocessors...

.

Benchmarks are designed to mimic a particular type of workload on a component or system. Synthetic benchmarks do this by specially created programs that impose the workload on the component. Application benchmarks run real-world programs on the system. Whilst application benchmarks usually give a much better measure of real-world performance on a given system, synthetic benchmarks are useful for testing individual components, like a hard disk
Hard disk
A hard disk drive is a non-volatile, random access digital magnetic data storage device. It features rotating rigid platters on a motor-driven spindle within a protective enclosure. Data is magnetically read from and written to the platter by read/write heads that float on a film of air above the...

 or networking device.

Benchmarks are particularly important in CPU design
CPU design
CPU design is the design engineering task of creating a central processing unit , a component of computer hardware. It is a subfield of electronics engineering and computer engineering.- Overview :CPU design focuses on these areas:...

, giving processor architects the ability to measure and make tradeoffs in microarchitectural
Microarchitecture
In computer engineering, microarchitecture , also called computer organization, is the way a given instruction set architecture is implemented on a processor. A given ISA may be implemented with different microarchitectures. Implementations might vary due to different goals of a given design or...

 decisions. For example, if a benchmark extracts the key algorithms of an application, it will contain the performance-sensitive aspects of that application. Running this much smaller snippet on a cycle-accurate simulator can give clues on how to improve performance.

Prior to 2000, computer and microprocessor architects used SPEC
Spec
-Specification:* Specification , an explicit set of requirements to be satisfied by a material, product, or service** "Spec sheet" or datasheet used to describe something technical...

 to do this, although SPEC's Unix-based benchmarks were quite lengthy and thus unwieldy to use intact.

Computer manufacturers are known to configure their systems to give unrealistically high performance on benchmark tests that are not replicated in real usage. For instance, during the 1980s some compilers could detect a specific mathematical operation used in a well-known floating-point benchmark and replace the operation with a faster mathematically-equivalent operation. However, such a transformation was rarely useful outside the benchmark until the mid-1990s, when RISC and VLIW architectures emphasized the importance of compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...

 technology as it related to performance. Benchmarks are now regularly used by compiler
Compiler
A compiler is a computer program that transforms source code written in a programming language into another computer language...

 companies to improve not only their own benchmark scores, but real application performance.

CPUs that have many execution units — such as a superscalar
Superscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...

 CPU, a VLIW CPU, or a reconfigurable computing
Reconfigurable computing
Reconfigurable computing is a computer architecture combining some of the flexibility of software with the high performance of hardware by processing with very flexible high speed computing fabrics like field-programmable gate arrays...

 CPU — typically have slower clock rates than a sequential CPU with one or two execution units when built from transistors that are just as fast. Nevertheless, CPUs with many execution units often complete real-world and benchmark tasks in less time than the supposedly faster high-clock-rate CPU.

Given the large number of benchmarks available, a manufacturer can usually find at least one benchmark that shows its system will outperform another system; the other systems can be shown to excel with a different benchmark.

Manufacturers commonly report only those benchmarks (or aspects of benchmarks) that show their products in the best light. They also have been known to mis-represent the significance of benchmarks, again to show their products in the best possible light. Taken together, these practices are called bench-marketing.

Ideally benchmarks should only substitute for real applications if the application is unavailable, or too difficult or costly to port to a specific processor or computer system. If performance is critical, the only benchmark that matters is the target environment's application suite.

Challenges

Benchmarking is not easy and often involves several iterative rounds in order to arrive at predictable, useful conclusions. Interpretation of benchmarking data is also extraordinarily difficult. Here is a partial list of common challenges:
  • Vendors tend to tune their products specifically for industry-standard benchmarks. Norton SysInfo (SI) is particularly easy to tune for, since it mainly biased toward the speed of multiple operations. Use extreme caution in interpreting such results.
  • Some vendors have been accused of "cheating" at benchmarks — doing things that give much higher benchmark numbers, but make things worse on the actual likely workload.
  • Many benchmarks focus entirely on the speed of computational performance, neglecting other important features of a computer system, such as:
    • Qualities of service, aside from raw performance. Examples of unmeasured qualities of service include security, availability, reliability, execution integrity, serviceability, scalability (especially the ability to quickly and nondisruptively add or reallocate capacity), etc. There are often real trade-offs between and among these qualities of service, and all are important in business computing. Transaction Processing Performance Council
      Transaction Processing Performance Council
      Transaction Processing Performance Council is a non-profit organization founded in 1988 to define transaction processing and database benchmarks and to disseminate objective, verifiable TPC performance data to the industry...

       Benchmark specifications partially address these concerns by specifying ACID
      ACID
      In computer science, ACID is a set of properties that guarantee database transactions are processed reliably. In the context of databases, a single logical operation on the data is called a transaction...

       property tests, database scalability rules, and service level requirements.
    • In general, benchmarks do not measure Total cost of ownership
      Total cost of ownership
      Total cost of ownership is a financial estimate whose purpose is to help consumers and enterprise managers determine direct and indirect costs of a product or system...

      . Transaction Processing Performance Council Benchmark specifications partially address this concern by specifying that a price/performance metric must be reported in addition to a raw performance metric, using a simplified TCO
      Total cost of ownership
      Total cost of ownership is a financial estimate whose purpose is to help consumers and enterprise managers determine direct and indirect costs of a product or system...

       formula. However, the costs are necessarily only partial, and vendors have been known to price specifically (and only) for the benchmark, designing a highly specific "benchmark special" configuration with an artificially low price. Even a tiny deviation from the benchmark package results in a much higher price in real world experience.
    • Facilities burden (space, power, and cooling). When more power is used, a portable system will have a shorter battery life and require recharging more often. A server that consumes more power and/or space may not be able to fit within existing data center resource constraints, including cooling limitations. There are real trade-offs as most semiconductors require more power to switch faster. See also performance per watt
      Performance per watt
      In computing, performance per watt is a measure of the energy efficiency of a particular computer architecture or computer hardware. Literally, it measures the rate of computation that can be delivered by a computer for every watt of power consumed....

      .
    • In some embedded systems, where memory is a significant cost, better code density can significantly reduce costs.

  • Vendor benchmarks tend to ignore requirements for development, test, and disaster recovery
    Disaster recovery
    Disaster recovery is the process, policies and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organization after a natural or human-induced disaster. Disaster recovery is a subset of business continuity...

     computing capacity. Vendors only like to report what might be narrowly required for production capacity in order to make their initial acquisition price seem as low as possible.
  • Benchmarks are having trouble adapting to widely distributed servers, particularly those with extra sensitivity to network topologies. The emergence of grid computing
    Grid computing
    Grid computing is a term referring to the combination of computer resources from multiple administrative domains to reach a common goal. The grid can be thought of as a distributed system with non-interactive workloads that involve a large number of files...

    , in particular, complicates benchmarking since some workloads are "grid friendly", while others are not.
  • Users can have very different perceptions of performance than benchmarks may suggest. In particular, users appreciate predictability — servers that always meet or exceed service level agreement
    Service Level Agreement
    A service-level agreement is a part of a service contract where the level of service is formally defined. In practice, the term SLA is sometimes used to refer to the contracted delivery time or performance...

    s. Benchmarks tend to emphasize mean scores (IT perspective), rather than maximum worst-case response times (real-time computing
    Real-time computing
    In computer science, real-time computing , or reactive computing, is the study of hardware and software systems that are subject to a "real-time constraint"— e.g. operational deadlines from event to system response. Real-time programs must guarantee response within strict time constraints...

     perspective), or low standard deviations (user perspective).
  • Many server architectures degrade dramatically at high (near 100%) levels of usage — "fall off a cliff" — and benchmarks should (but often do not) take that factor into account. Vendors, in particular, tend to publish server benchmarks at continuous at about 80% usage — an unrealistic situation — and do not document what happens to the overall system when demand spikes beyond that level.
  • Many benchmarks focus on one application, or even one application tier, to the exclusion of other applications. Most data centers are now implementing virtualization
    Hardware virtualization
    Computer hardware virtualization is the virtualization of computers or operating systems. It hides the physical characteristics of a computing platform from users, instead showing another abstract computing platform...

     extensively for a variety of reasons, and benchmarking is still catching up to that reality where multiple applications and application tiers are concurrently running on consolidated servers.
  • There are few (if any) high quality benchmarks that help measure the performance of batch computing, especially high volume concurrent batch and online computing. Batch computing tends to be much more focused on the predictability of completing long-running tasks correctly before deadlines, such as end of month or end of fiscal year. Many important core business processes are batch-oriented and probably always will be, such as billing.
  • Benchmarking institutions often disregard or do not follow basic scientific method. This includes, but is not limited to: small sample size, lack of variable control, and the limited repeatability of results.

Types of benchmarks

  1. Real program
    • word processing software
    • tool software of CDA
    • user's application software (i.e.: MIS)
  2. Microbenchmark
    • Designed to measure the performance of a very small and specific piece of code.
  3. Kernel
    • contains key codes
    • normally abstracted from actual program
    • popular kernel: Livermore loop
    • linpack benchmark (contains basic linear algebra subroutine written in FORTRAN language)
    • results are represented in MFLOPS
  4. Component Benchmark/ micro-benchmark
    • programs designed to measure performance of a computer's basic components
    • automatic detection of computer's hardware parameters like number of registers, cache size, memory latency
  5. Synthetic Benchmark
    • Procedure for programming synthetic benchmark:
      • take statistics of all types of operations from many application programs
      • get proportion of each operation
      • write program based on the proportion above
    • Types of Synthetic Benchmark are:
      • Whetstone
        Whetstone (benchmark)
        The Whetstone benchmark is a synthetic benchmark for evaluating the performance of computers. It was first written in Algol 60 in 1972 at the National Physical Laboratory in the United Kingdom and derived from statistics on program behaviour gathered on the KDF9 computer, using a modified version...

      • Dhrystone
        Dhrystone
        Dhrystone is a synthetic computing benchmark program developed in 1984 by Reinhold P. Weicker intended to be representative of system programming. The Dhrystone grew to become representative of general processor performance...

    • These were the first general purpose industry standard computer benchmarks. They do not necessarily obtain high scores on modern pipelined computers.
  6. I/O benchmarks
  7. Database benchmarks: to measure the throughput and response times of database management systems (DBMS')
  8. Parallel benchmarks: used on machines with multiple processors or systems consisting of multiple machines

Industry standard (audited and verifiable)

  • Business Applications Performance Corporation (BAPCo)
    BAPCo consortium
    BAPCo, Business Applications Performance Corporation, is a non-profit consortium with a charter to develop and distribute a set of objective performance benchmarks for personal computers based on popular software applications and operating systems....

  • Embedded Microprocessor Benchmark Consortium (EEMBC)
    EEMBC
    EEMBC, the Embedded Microprocessor Benchmark Consortium, is a non-profit organization formed in 1997 with the aim of developing meaningful performance benchmarks for the hardware and software used in embedded systems...

  • Standard Performance Evaluation Corporation
    Standard Performance Evaluation Corporation
    The Standard Performance Evaluation Corporation is a non-profit organization that aims to "produce, establish, maintain and endorse a standardized set" of performance benchmarks for computers....

     (SPEC), in particular their SPECint
    SPECint
    SPECint is a computer benchmark specification for CPU's integer processing power. It is maintained by the Standard Performance Evaluation Corporation . SPECint is the integer performance testing component of the SPEC test suite. The first SPEC test suite, CPU92, was announced in 1992. It was...

     and SPECfp
    SPECfp
    SPECfp is a computer benchmark designed to test the floating point performance of a computer. It is managed by the Standard Performance Evaluation Corporation. SPECfp is the floating point performance testing component of the SPEC CPU testing suit. The first stander SPECfp was released in 1989 as...

  • Transaction Processing Performance Council
    Transaction Processing Performance Council
    Transaction Processing Performance Council is a non-profit organization founded in 1988 to define transaction processing and database benchmarks and to disseminate objective, verifiable TPC performance data to the industry...

     (TPC)
  • Coremark
    Coremark
    CoreMark is a benchmark that aims to measure the performance of central processing units used in embedded systems. It was developed in 2009 by Shay Gal-On at EEMBC and is intended to become an industry standard, replacing the antiquated Dhrystone benchmark...

    : Embedded computing standard benchmark

Open source benchmarks

  • DEISA Benchmark Suite: scientific HPC applications benchmark
  • Dhrystone
    Dhrystone
    Dhrystone is a synthetic computing benchmark program developed in 1984 by Reinhold P. Weicker intended to be representative of system programming. The Dhrystone grew to become representative of general processor performance...

    : integer arithmetic performance
  • Fhourstones
    Fhourstones
    Fhourstones is an integer benchmark that efficiently solves positions in the game ofConnect-4.Available in both ANSI-C and Java, it is quite portable and compact , and uses 50Mb of memory....

    : an integer benchmark
  • HINT
    Hierarchical INTegration
    Hierarchical INTegration, or HINT for short, is a computer benchmark that ranks a computer system as a whole . It measures the full range of performance, mostly based on the amount of work a computer can perform over time...

    : It ranks a computer system as a whole.
  • Iometer
    Iometer
    Iometer is an I/O subsystem measurement and characterization tool for single and clustered systems. It is used as a benchmark and troubleshooting tool and is easily configured to replicate the behaviour of many popular applications...

    : I/O subsystem measurement and characterization tool for single and clustered systems.
  • Linpack
    LINPACK
    LINPACK is a software library for performing numerical linear algebra on digital computers. It was written in Fortran by Jack Dongarra, Jim Bunch, Cleve Moler, and Gilbert Stewart, and was intended for use on supercomputers in the 1970s and early 1980s...

    , traditionally used to measure FLOPS
    FLOPS
    In computing, FLOPS is a measure of a computer's performance, especially in fields of scientific calculations that make heavy use of floating-point calculations, similar to the older, simpler, instructions per second...

  • LAPACK
    LAPACK
    -External links:* : a modern replacement for PLAPACK and ScaLAPACK* on Netlib.org* * * : a modern replacement for LAPACK that is MultiGPU ready* on Sourceforge.net* * optimized LAPACK for Solaris OS on SPARC/x86/x64 and Linux* * *...

  • Livermore loops
    Livermore loops
    Livermore loops is a benchmark for parallel computers. It was created by Francis H. McMahon from scientific source code run on computers at Lawrence Livermore National Laboratory...

  • NAS parallel benchmarks
    NAS benchmarks
    The NAS Parallel Benchmarks are a set of benchmarks targeting performance evaluation of highly parallel supercomputers. They are developed and maintained by the NASA Advanced Supercomputing Division based at the NASA Ames Research Center...

  • NBench
    NBench
    NBench is a synthetic computing benchmark program developed in the mid 1990s by the now defunct BYTE magazine intended to measure a computer's CPU, FPU, and Memory System speed.- History :...

    : synthetic benchmark suite measuring performance of integer arithmetic, memory operations, and floating-point arithmetic
  • PAL: a benchmark for realtime physics engines
  • Phoronix Test Suite
    Phoronix Test Suite
    Phoronix Test Suite is a free, open-source benchmark software for Linux and other operating systems developed by Phoronix Media with cooperation from an undisclosed number of hardware and software vendors....

    : open-source cross-platform benchmarking suite for Linux,OpenSolaris, FreeBSD, OSX and Windows. It includes a number of other benchmarks included on this page to simplify execution.
  • POV-Ray
    POV-Ray
    The Persistence of Vision Raytracer, or POV-Ray, is a ray tracing program available for a variety of computer platforms. It was originally based on DKBTrace, written by David Kirk Buck and Aaron A. Collins. There are also influences from the earlier Polyray raytracer contributed by its author...

    : 3D render
  • Tak (function): a simple benchmark used to test recursion performance
  • TATP Benchmark
    TATP Benchmark
    The Telecommunication Application Transaction Processing Benchmark is a benchmark designed to measure performance of in-memory database transaction systems....

    : Telecommunication Application Transaction Processing Benchmark
  • TPoX: An XML transaction processing benchmark for XML databases
  • Whetstone
    Whetstone (benchmark)
    The Whetstone benchmark is a synthetic benchmark for evaluating the performance of computers. It was first written in Algol 60 in 1972 at the National Physical Laboratory in the United Kingdom and derived from statistics on program behaviour gathered on the KDF9 computer, using a modified version...

    : floating-point arithmetic performance

Microsoft Windows benchmarks

  • BAPCo
    BAPCo consortium
    BAPCo, Business Applications Performance Corporation, is a non-profit consortium with a charter to develop and distribute a set of objective performance benchmarks for personal computers based on popular software applications and operating systems....

    : MobileMark, SYSmark, WebMark
  • Futuremark
    Futuremark
    Futuremark Oy is a Finnish software development company, that produces computer benchmark applications for home users and businesses. Company headquarters and R&D department are located in Espoo, Finland...

    : 3DMark
    3DMark
    3DMark is a computer benchmarking tool created and developed by Futuremark Corporation to determine the performance of a computer's 3D graphic rendering and CPU workload processing capabilities. Running 3DMark produces a 3DMark score with higher numbers indicating better performance...

    , PCMark
    PCMark
    PCMark is a computer benchmark tool developed by Futuremark to test the performance of a PC at the system and component level. In most cases the tests in PCMark are designed to represent typical home user workloads. Running PCMark produces a score with higher numbers indicating better performance...

  • Whetstone
    Whetstone (benchmark)
    The Whetstone benchmark is a synthetic benchmark for evaluating the performance of computers. It was first written in Algol 60 in 1972 at the National Physical Laboratory in the United Kingdom and derived from statistics on program behaviour gathered on the KDF9 computer, using a modified version...

  • Worldbench
    WorldBench
    WorldBench is a Windows benchmarktool offered by PC World Labs since 2000. The PC World Test Center uses this same tool every month to test contenders for PC World's Top Desktop PCs and Top Notebook PCs charts...

  • PiFast
  • SuperPrime
    SuperPrime
    SuperPrime is a computer program used for calculating the primality of a large set of positive natural numbers. Because of its multi-threaded nature and dynamic load scheduling, it scales excellently when using more than one thread...

  • Super PI
  • Windows System Assessment Tool
    Windows System Assessment Tool
    The Windows System Assessment Tool is a module of Microsoft Windows Vista and Windows 7 which measures various performance characteristics and capabilities of the hardware it is running on and reports them as a Windows Experience Index score, a number from 1.0 and 5.9 for Windows Vista and from...

    , included with Microsoft Windows Vista and later operating systems, providing an index for consumers to rate their systems easily

Others

  • BRL-CAD
    BRL-CAD
    BRL-CAD is a constructive solid geometry solid modeling computer-aided design system. It includes an interactive geometry editor, ray tracing support for graphics rendering and geometric analysis, computer network distributed framebuffer support, scripting, image-processing and signal-processing...

  • Khornerstone
    Khornerstone
    In computer performance testing, Khornerstone is a multipurpose benchmark from Workstation Labs used in various periodicals such as UNIX Review....

  • iCOMP
    ICOMP
    iCOMP for Intel Comparative Microprocessor Performance was an index published by Intel used to measure the relative performance of its microprocessors....

    , the Intel comparative microprocessor performance, published by Intel
  • Performance Rating, modeling scheme used by AMD and Cyrix to reflect the relative performance usually compared to competing products.
  • VMmark
    VMmark
    VMmark is a freeware virtual machine benchmark software suite from VMware, Inc., a division of EMC Corporation. The suite measures the performance of virtualized servers while running under load on a set of physical hardware...

    , a virtualization benchmark suite.
  • Sunspider, a Browser speed test
  • BreakingPoint Systems
    BreakingPoint Systems
    BreakingPoint Systems provides network simulation and testing tools and services for testing network equipment and application servers. BreakingPoint was listed as one of the top private technology companies in AlwaysOn's 2009 Global 250 and Frost & Sullivan's emerging technology company of the...

    , modeling and simulation of network application traffic for benchmarking servers and network equipment, a benchmark for testing massively parallel computer systems under simultaneously heavy network, memory, and CPU loads.

See also

  • Benchmarking
    Benchmarking
    Benchmarking is the process of comparing one's business processes and performance metrics to industry bests and/or best practices from other industries. Dimensions typically measured are quality, time and cost...

     (business perspective)
  • Test suite
    Test suite
    In software development, a test suite, less commonly known as a validation suite, is a collection of test cases that are intended to be used to test a software program to show that it has some specified set of behaviours. A test suite often contains detailed instructions or goals for each...

     a collection of test cases intended to show that a software program has some specified set of behaviors
  • Figure of merit
    Figure of merit
    A figure of merit is a quantity used to characterize the performance of a device, system or method, relative to its alternatives. In engineering, figures of merit are often defined for particular materials or devices in order to determine their relative utility for an application...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK