Hardware performance counter
Encyclopedia
In computer
s, hardware performance counters, or hardware counters are a set of special-purpose register
s built into modern microprocessor
s to store the counts of hardware-related activities within computer systems. Advanced users often rely on those counters to conduct low-level performance analysis
or tuning
.
The number of available hardware counters in a processor is limited while each cpu model might have a lot of different events that a developer might like to measure. Each counter can be programmed with the index of an event type to be monitored, like a L1 cache miss or a branch mispredict.
One of the first processors to implement such counters was the Intel Pentium
, but they were not documented until Terje Mathisen wrote an article about reverse engineering them in BYTE
July 1994:
The following table shows some examples of cpus and the number of available hardware counters:
Compared to software profilers, hardware counters provide low-overhead access to a wealth of detailed performance information related to CPU's functional units, caches and main memory etc. Another benefit of using them is that no source code modifications are needed in general. However, the types and meanings of hardware counters vary from one kind of architecture to another due to the variation in hardware organizations.
There can be difficulties correlating the low level performance metrics back to source code. The limited number of registers to store the counters often force users to conduct multiple measurements to collect all desired performance metrics. Moreover, modern superscalar
processors schedule and execute multiple instructions at one time. These "in-flight" instructions can retire at any time, depending on memory access, hits in cache, stalls in the pipeline and many other factors. This can cause performance counter events to be attributed to the wrong instructions, making precise performance analysis difficult or impossible.
Newer processors have introduced methods to mitigate some of these drawbacks. For example, the AMD Opteron processors have implemented in 2007 a technique known as Instruction Based Sampling (or IBS). AMD's implementation of IBS provides hardware counters for both fetch sampling (the front of the superscalar pipeline) and op sampling (the back of the pipeline). This results in discrete performance data associating retired instructions with the "parent" AMD64 instruction.
Computer
A computer is a programmable machine designed to sequentially and automatically carry out a sequence of arithmetic or logical operations. The particular sequence of operations can be changed readily, allowing the computer to solve more than one kind of problem...
s, hardware performance counters, or hardware counters are a set of special-purpose register
Processor register
In computer architecture, a processor register is a small amount of storage available as part of a CPU or other digital processor. Such registers are addressed by mechanisms other than main memory and can be accessed more quickly...
s built into modern microprocessor
Microprocessor
A microprocessor incorporates the functions of a computer's central processing unit on a single integrated circuit, or at most a few integrated circuits. It is a multipurpose, programmable device that accepts digital data as input, processes it according to instructions stored in its memory, and...
s to store the counts of hardware-related activities within computer systems. Advanced users often rely on those counters to conduct low-level performance analysis
Performance analysis
In software engineering, profiling is a form of dynamic program analysis that measures, for example, the usage of memory, the usage of particular instructions, or frequency and duration of function calls...
or tuning
Performance tuning
Performance tuning is the improvement of system performance. This is typically a computer application, but the same methods can be applied to economic markets, bureaucracies or other complex systems. The motivation for such activity is called a performance problem, which can be real or anticipated....
.
The number of available hardware counters in a processor is limited while each cpu model might have a lot of different events that a developer might like to measure. Each counter can be programmed with the index of an event type to be monitored, like a L1 cache miss or a branch mispredict.
One of the first processors to implement such counters was the Intel Pentium
Pentium
The original Pentium microprocessor was introduced on March 22, 1993. Its microarchitecture, deemed P5, was Intel's fifth-generation and first superscalar x86 microarchitecture. As a direct extension of the 80486 architecture, it included dual integer pipelines, a faster FPU, wider data bus,...
, but they were not documented until Terje Mathisen wrote an article about reverse engineering them in BYTE
Byte
The byte is a unit of digital information in computing and telecommunications that most commonly consists of eight bits. Historically, a byte was the number of bits used to encode a single character of text in a computer and for this reason it is the basic addressable element in many computer...
July 1994:
The following table shows some examples of cpus and the number of available hardware counters:
Processor | available HW counters |
---|---|
UltraSparc SPARC SPARC is a RISC instruction set architecture developed by Sun Microsystems and introduced in mid-1987.... II |
2 |
Pentium III Pentium III The Pentium III brand refers to Intel's 32-bit x86 desktop and mobile microprocessors based on the sixth-generation P6 microarchitecture introduced on February 26, 1999. The brand's initial processors were very similar to the earlier Pentium II-branded microprocessors... |
2 |
AMD Athlon Athlon Athlon is the brand name applied to a series of x86-compatible microprocessors designed and manufactured by Advanced Micro Devices . The original Athlon was the first seventh-generation x86 processor and, in a first, retained the initial performance lead it had over Intel's competing processors... |
4 |
IA-64 | 4 |
POWER4 POWER4 The POWER4 is a microprocessor developed by International Business Machines that implemented the 64-bit PowerPC and PowerPC AS instruction set architectures. Released in 2001, the POWER4 succeeded the POWER3 and RS64 microprocessors, and was used in RS/6000 and AS/400 computers, ending a separate... |
8 |
Pentium 4 Pentium 4 Pentium 4 was a line of single-core desktop and laptop central processing units , introduced by Intel on November 20, 2000 and shipped through August 8, 2008. They had a 7th-generation x86 microarchitecture, called NetBurst, which was the company's first all-new design since the introduction of the... |
18 |
Compared to software profilers, hardware counters provide low-overhead access to a wealth of detailed performance information related to CPU's functional units, caches and main memory etc. Another benefit of using them is that no source code modifications are needed in general. However, the types and meanings of hardware counters vary from one kind of architecture to another due to the variation in hardware organizations.
There can be difficulties correlating the low level performance metrics back to source code. The limited number of registers to store the counters often force users to conduct multiple measurements to collect all desired performance metrics. Moreover, modern superscalar
Superscalar
A superscalar CPU architecture implements a form of parallelism called instruction level parallelism within a single processor. It therefore allows faster CPU throughput than would otherwise be possible at a given clock rate...
processors schedule and execute multiple instructions at one time. These "in-flight" instructions can retire at any time, depending on memory access, hits in cache, stalls in the pipeline and many other factors. This can cause performance counter events to be attributed to the wrong instructions, making precise performance analysis difficult or impossible.
Newer processors have introduced methods to mitigate some of these drawbacks. For example, the AMD Opteron processors have implemented in 2007 a technique known as Instruction Based Sampling (or IBS). AMD's implementation of IBS provides hardware counters for both fetch sampling (the front of the superscalar pipeline) and op sampling (the back of the pipeline). This results in discrete performance data associating retired instructions with the "parent" AMD64 instruction.