Acumem SlowSpotter - AbsoluteAstronomy.com

Acumem SlowSpotter and Acumem ThreadSpotter are tools from Acumem, which diagnose performance problems related to data locality

Locality of reference

In computer science, locality of reference, also known as the principle of locality, is the phenomenon of the same value or related storage locations being frequently accessed. There are two basic types of reference locality. Temporal locality refers to the reuse of specific data and/or resources...

, cache utilization

CPU cache

A CPU cache is a cache used by the central processing unit of a computer to reduce the average time to access memory. The cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations...

and thread

Thread (computer science)

In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process...

interactions.

Generally speaking, memory bus

Memory bus

The memory bus is the computer bus which connects the main memory to the memory controller in computer systems. Originally, general-purpose buses like VMEbus and the S-100 bus were used, but to reduce latency, modern memory buses are designed to connect directly to DRAM chips, and thus are...

bandwidth

Bandwidth (computing)

In computer networking and computer science, bandwidth, network bandwidth, data bandwidth, or digital bandwidth is a measure of available or consumed data communication resources expressed in bits/second or multiples of it .Note that in textbooks on wireless communications, modem data transmission,...

has not seen the same improvement as CPU performance (an observation sometimes referred to as the memory wall), and with multi-core and many-core systems, the available bandwidth is shared between all cores. This makes preservation of memory bandwidth one of the most important tasks in achieving top performance.

The Acumem tools finds suboptimal memory access patterns and applies heuristics to categorize these problems to be able to explain the root cause and to suggest ideas for improving the code.

The tools have predictive capabilities, as one captured performance fingerprint can be used to predict performance characteristics on both existing and hypothetical architectures. The what-if analysis also allows exploring the optimal ways to bind threads to cores, all from a single sampling run.

Spatial locality

Spatial locality refers to the desirable property of accessing close memory locations. Poor spatial locality is penalized in a number of ways:

Accessing data very sparsely will have the negative side effect of transferring unused data over the memory bus (since data travels in chunks). This raises the memory bandwidth requirements, and in practice imposes a limit on the application performance and scalability.
Unused data will occupy the caches, reducing the effective cache size. This causes more frequent evictions and more round trips to memory.
Unused data will reduce the likelihood of encountering more than one useful piece of data in a mapped cache line.

Acumem SlowSpotter and Acumem ThreadSpotter identifies the following opportunities for improving Spatial Locality:

Poor Fetch Utilization
Poor Write-back Utilization
Incorrectly nested loops
Spatial blocking (also known as Loop tiling
Loop tiling
Loop tiling, also known as loop blocking, or strip mine and interchange, is a loop optimization used by compilers to make the execution of certain types of loops more efficient.- Overview :...

)
Spatial Loop fusion
Loop fusion
Loop fusion, also called loop jamming, is a compiler optimization, a loop transformation, which replaces multiple loops with a single one.- Example in C : int i, a[100], b[100]; for Loop fusion, also called loop jamming, is a compiler optimization, a loop transformation, which replaces multiple...

Temporal locality

Temporal locality relates to reuse of data. Reusing data while it is still in the cache avoids sustaining memory fetch stalls and generally reduces the memory bus load. Acumem SlowSpotter and Acumem ThreadSpotter classifies these optimization opportunities:

Temporal blocking
Temporal loop fusion

Memory Latency

The times it takes to initiate a memory fetch is referred to as memory latency. During this time, the CPU is stalled. The latency penalty is on the order of 100 clock cycles

Cycles Per Instruction

In computer architecture, cycles per instruction is a term used to describe one aspect of a processor's performance: the number of clock cycles that happen when an instruction is being executed...

.

Caches

CPU cache

were invented to hide such problems, by serving a limited amount of data from a small but fast memory. This is effective if the data set can be coerced to fit in the cache.

A different technique is called prefetching

Instruction prefetch

In computer architecture, instruction prefetch is a technique used in microprocessors to speed up the execution of a program by reducing wait states....

, where data transfer is initiated explicitly or automatically ahead of when the data is needed. Hopefully, the data will have reached the cache by the time it is needed.

Acumem SlowSpotter and Acumem ThreadSpotter identifies problems in this area:

Prefetch instruction too close to the subsequent data use
Prefetch instruction too far from the use (data may be evicted before use)
Unnecessary prefetch instruction (since data is already in the cache)

Furthermore, the tools judge how well the hardware prefetcher is working, by modelling its behavior and by recording cases of:

Irregular memory access pattern

Avoiding cache pollution

If a dataset has a footprint

Memory footprint

Memory footprint refers to the amount of main memory that a program uses or references while running.This includes all sorts of active memory regions like code, static data sections , heap, as well as all the stacks, plus memory required to hold any additional data structures, such as symbol...

larger than the available cache, and there is no practical way to reorganize the access patterns to improve reuse

Code reuse

Code reuse, also called software reuse, is the use of existing software, or software knowledge, to build new software.-Overview:Ad hoc code reuse has been practiced from the earliest days of programming. Programmers have always reused sections of code, templates, functions, and procedures...

, there is no benefit from storing that data in the cache in the first place. Some CPUs have special instructions for bypassing caches, exactly for this purpose.

Acumem SlowSpotter and Acumem ThreadSpotter finds opportunities to:

Replace write instructions with non-temporal write instructions.
Insert non-temporal prefetch hints to avoid mapping the data in higher level caches.

Coherency traffic effectiveness

When there are several caches in a system, they will be kept consistent

Data consistency

Data consistency summarizes the validity, accuracy, usability and integrity of related data between applications and across an IT enterprise. This ensures that each user observes a consistent view of the data, including visible changes made by the user's own transactions and transactions of other...

with each other. The activities to manage the data consistency take some time to carry out. Just like it is important to observe locality properties in how you access memory locations, it is important to pay attention to, and limit the implicit coherence

Cache coherency

In computing, cache coherence refers to the consistency of data stored in local caches of a shared resource.When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data. This is particularly true of CPUs in a multiprocessing system...

traffic.

For instance, in a producer/consumer scenario where the threads use a piece of shared memory to transfer data between themselves, the ownership of this memory will repeatedly shift between the producer's and the consumer's caches. If not all data in a cache line is fully consumed, but the producer revisits the cache line again, then this is a case of a poor communication pattern. It would be better to fill up the entire cache line before handing over it to the consumer.

Acumem ThreadSpotter classifies this as:

Poor communication utilization

Thread interaction

If two threads use their own variables

Variable (programming)

In computer programming, a variable is a symbolic name given to some known or unknown quantity or information, for the purpose of allowing the name to be used independently of the information it represents...

, but if these variables are laid out on the same cache line, the cache line ownership will also move back and forth between the two threads. This condition can usually be fixed by separating the variables to reside in different cache lines.

Acumem ThreadSpotter identifies this as a case of:

False sharing
False sharing
In computer science, false sharing is a performance degrading usage pattern that can arise in systems with distributed, coherent caches at the size of the smallest resource block managed by the caching mechanism...

Cache sharing effects

In CPUs with multiple levels of cache, the cache topology is sometimes asymmetric, in the sense that the communication cost between two caches on the same level is non-uniform.

Acumem ThreadSpotter allows experimenting with different thread/core bindings to compare the combined effect from sharing and coherence. If threads are sharing a cache, then the effective cache size per thread is smaller, but if the two threads share data it may be faster than if they didn't share a cache, as then the coherence traffic would start to be noticeable.

The source of this article is wikipedia, the free encyclopedia. The text of this article is licensed under the GFDL.