Remote Direct Memory Access
Encyclopedia
In computing
, remote direct memory access (RDMA) is a direct memory access
from the memory of one computer into that of another without involving either one's operating system
. This permits high-throughput, low-latency
networking, which is especially useful in massively parallel computer clusters.
RDMA supports zero-copy
networking by enabling the network adapter to transfer data directly to or from application memory, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs
, cache
s, or context switch
es, and transfers continue in parallel with other system operations. When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reducing latency and enabling fast message transfer.
This strategy presents several problems related to the fact that the target node is not notified of the completion of the request (1-sided communications). The common way to notify it is to change a memory byte when the data has been delivered, but it requires the target to poll on this byte. Not only does this polling consume CPU cycles, but the memory footprint and the latency increases linearly with the number of possible other nodes, which limits use of RDMA in High-Performance Computing (HPC)
.
RDMA reduces network protocol overhead, leading to improvements in communication latency. Reductions in protocol overhead can increase a network's ability to move data quickly
, allowing applications to get the data they need faster, in turn leading to more scalable clusters. However, one must be aware of the tradeoff between this reduction in network protocol overhead and additional overhead that may be incurred on each node due to the need for pinning virtual memory pages. In particular, zero-copy
RDMA protocols require that the memory pages involved in a transaction be pinned, at least for the duration of the transfer. If this is not done, RDMA pages might be paged out to disk
and replaced with other data by the operating system, causing the DMA
engine (which knows nothing of the virtual memory system maintained by the operating system) to send the wrong data. The net result of not pinning the pages in a zero-copy RDMA system can be corruption of the contents of memory in the distributed system. Pinning memory takes time and additional memory to setup, reduces the quantity of memory the operating system can allocate to processes, limits the overall flexibility of the memory system to adapt over time, and could even lead to underutilization of memory if processes unnecessarily pin pages. The net result is the introduction of latency, sometimes in linear proportion to the number of pages of data pinned in memory. In order to mitigate these problems, several techniques for interfacing with RDMA devices were developed:
In contrast, the Send/Recv model used by other zero-copy
HPC interconnects, such as Myrinet
or Quadrics
, does not have the 1-sided communication problem or the memory paging problem described above while at the same time managing to provide comparable reductions in latency when used in conjunction with HPC communication frameworks that expose the Send/Recv model to the programmer (such as MPI
).
Much like other HPC interconnects, RDMA’s acceptance is currently limited by the need to install a different networking infrastructure. However, new standards enable Ethernet
RDMA implementation at the physical layer and TCP
/IP
as the transport, combining the performance and latency advantages of RDMA with a low-cost, standards-based solution. The RDMA Consortium and the DAT Collaborative have played key roles in the development of RDMA protocols and APIs
for consideration by standards groups such as the Internet Engineering Task Force
and the Interconnect Software Consortium. Software vendors such as Red Hat
and Oracle Corporation
support these APIs in their latest products, and network adapters that implement RDMA over Ethernet are being developed. Both Red Hat Enterprise Linux and Red Hat Enterprise MRG have support for RDMA.
Common RDMA implementations include the Virtual Interface Architecture
, InfiniBand
, and iWARP
.
Computing
Computing is usually defined as the activity of using and improving computer hardware and software. It is the computer-specific part of information technology...
, remote direct memory access (RDMA) is a direct memory access
Direct memory access
Direct memory access is a feature of modern computers that allows certain hardware subsystems within the computer to access system memory independently of the central processing unit ....
from the memory of one computer into that of another without involving either one's operating system
Operating system
An operating system is a set of programs that manage computer hardware resources and provide common services for application software. The operating system is the most important type of system software in a computer system...
. This permits high-throughput, low-latency
Latency (engineering)
Latency is a measure of time delay experienced in a system, the precise definition of which depends on the system and the time being measured. Latencies may have different meaning in different contexts.-Packet-switched networks:...
networking, which is especially useful in massively parallel computer clusters.
RDMA supports zero-copy
Zero-copy
"Zero-copy" describes computer operations in which the CPU does not perform the task of copying data from one memory area to another. This is most often used to save on processing power and memory use when sending files over a network.- Principle :...
networking by enabling the network adapter to transfer data directly to or from application memory, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs
Central processing unit
The central processing unit is the portion of a computer system that carries out the instructions of a computer program, to perform the basic arithmetical, logical, and input/output operations of the system. The CPU plays a role somewhat analogous to the brain in the computer. The term has been in...
, cache
Cache
In computer engineering, a cache is a component that transparently stores data so that future requests for that data can be served faster. The data that is stored within a cache might be values that have been computed earlier or duplicates of original values that are stored elsewhere...
s, or context switch
Context switch
A context switch is the computing process of storing and restoring the state of a CPU so that execution can be resumed from the same point at a later time. This enables multiple processes to share a single CPU. The context switch is an essential feature of a multitasking operating system...
es, and transfers continue in parallel with other system operations. When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reducing latency and enabling fast message transfer.
This strategy presents several problems related to the fact that the target node is not notified of the completion of the request (1-sided communications). The common way to notify it is to change a memory byte when the data has been delivered, but it requires the target to poll on this byte. Not only does this polling consume CPU cycles, but the memory footprint and the latency increases linearly with the number of possible other nodes, which limits use of RDMA in High-Performance Computing (HPC)
High-performance computing
High-performance computing uses supercomputers and computer clusters to solve advanced computation problems. Today, computer systems approaching the teraflops-region are counted as HPC-computers.-Overview:...
.
RDMA reduces network protocol overhead, leading to improvements in communication latency. Reductions in protocol overhead can increase a network's ability to move data quickly
Network performance
Network performance refers to the service quality of a telecommunications product as seen by the customer. It should not be seen merely as an attempt to get "more through" the network....
, allowing applications to get the data they need faster, in turn leading to more scalable clusters. However, one must be aware of the tradeoff between this reduction in network protocol overhead and additional overhead that may be incurred on each node due to the need for pinning virtual memory pages. In particular, zero-copy
Zero-copy
"Zero-copy" describes computer operations in which the CPU does not perform the task of copying data from one memory area to another. This is most often used to save on processing power and memory use when sending files over a network.- Principle :...
RDMA protocols require that the memory pages involved in a transaction be pinned, at least for the duration of the transfer. If this is not done, RDMA pages might be paged out to disk
Paging
In computer operating systems, paging is one of the memory-management schemes by which a computer can store and retrieve data from secondary storage for use in main memory. In the paging memory-management scheme, the operating system retrieves data from secondary storage in same-size blocks called...
and replaced with other data by the operating system, causing the DMA
DMA
DMA can refer to:* DMA , a defunct dance music magazine* Dallas Museum of Art, an art museum in Texas, USA* Danish Music Awards, an award show held in Denmark since 1989...
engine (which knows nothing of the virtual memory system maintained by the operating system) to send the wrong data. The net result of not pinning the pages in a zero-copy RDMA system can be corruption of the contents of memory in the distributed system. Pinning memory takes time and additional memory to setup, reduces the quantity of memory the operating system can allocate to processes, limits the overall flexibility of the memory system to adapt over time, and could even lead to underutilization of memory if processes unnecessarily pin pages. The net result is the introduction of latency, sometimes in linear proportion to the number of pages of data pinned in memory. In order to mitigate these problems, several techniques for interfacing with RDMA devices were developed:
- using caching techniques to keep data pinned as long as possible, producing overhead reductions for applications that repeatedly communicate in the same memory area
- pipelining memory pinning operations and data transfer (as done on InfinibandInfiniBandInfiniBand is a switched fabric communications link used in high-performance computing and enterprise data centers. Its features include high throughput, low latency, quality of service and failover, and it is designed to be scalable...
or MyrinetMyrinetMyrinet, ANSI/VITA 26-1998, is a high-speed local area networking system designed by Myricom to be used as an interconnect between multiple machines to form computer clusters. Myrinet has much lower protocol overhead than standards such as Ethernet, and therefore provides better throughput, less...
) - deferring memory pinning out of the critical path, thus hiding the latency increase
- entirely removing the need for pinning (as QuadricsQuadricsQuadrics was a supercomputer company formed in 1996 as a joint venture between Alenia Spazio and the technical team from Meiko Scientific. They produced hardware and software for clustering commodity computer systems into massively parallel systems. Their highpoint was in June 2003 when six out of...
does)
In contrast, the Send/Recv model used by other zero-copy
Zero-copy
"Zero-copy" describes computer operations in which the CPU does not perform the task of copying data from one memory area to another. This is most often used to save on processing power and memory use when sending files over a network.- Principle :...
HPC interconnects, such as Myrinet
Myrinet
Myrinet, ANSI/VITA 26-1998, is a high-speed local area networking system designed by Myricom to be used as an interconnect between multiple machines to form computer clusters. Myrinet has much lower protocol overhead than standards such as Ethernet, and therefore provides better throughput, less...
or Quadrics
Quadrics
Quadrics was a supercomputer company formed in 1996 as a joint venture between Alenia Spazio and the technical team from Meiko Scientific. They produced hardware and software for clustering commodity computer systems into massively parallel systems. Their highpoint was in June 2003 when six out of...
, does not have the 1-sided communication problem or the memory paging problem described above while at the same time managing to provide comparable reductions in latency when used in conjunction with HPC communication frameworks that expose the Send/Recv model to the programmer (such as MPI
Message Passing Interface
Message Passing Interface is a standardized and portable message-passing system designed by a group of researchers from academia and industry to function on a wide variety of parallel computers...
).
Much like other HPC interconnects, RDMA’s acceptance is currently limited by the need to install a different networking infrastructure. However, new standards enable Ethernet
Ethernet
Ethernet is a family of computer networking technologies for local area networks commercially introduced in 1980. Standardized in IEEE 802.3, Ethernet has largely replaced competing wired LAN technologies....
RDMA implementation at the physical layer and TCP
Transmission Control Protocol
The Transmission Control Protocol is one of the core protocols of the Internet Protocol Suite. TCP is one of the two original components of the suite, complementing the Internet Protocol , and therefore the entire suite is commonly referred to as TCP/IP...
/IP
Internet Protocol
The Internet Protocol is the principal communications protocol used for relaying datagrams across an internetwork using the Internet Protocol Suite...
as the transport, combining the performance and latency advantages of RDMA with a low-cost, standards-based solution. The RDMA Consortium and the DAT Collaborative have played key roles in the development of RDMA protocols and APIs
Application programming interface
An application programming interface is a source code based specification intended to be used as an interface by software components to communicate with each other...
for consideration by standards groups such as the Internet Engineering Task Force
Internet Engineering Task Force
The Internet Engineering Task Force develops and promotes Internet standards, cooperating closely with the W3C and ISO/IEC standards bodies and dealing in particular with standards of the TCP/IP and Internet protocol suite...
and the Interconnect Software Consortium. Software vendors such as Red Hat
Red Hat
Red Hat, Inc. is an S&P 500 company in the free and open source software sector, and a major Linux distribution vendor. Founded in 1993, Red Hat has its corporate headquarters in Raleigh, North Carolina with satellite offices worldwide....
and Oracle Corporation
Oracle Corporation
Oracle Corporation is an American multinational computer technology corporation that specializes in developing and marketing hardware systems and enterprise software products – particularly database management systems...
support these APIs in their latest products, and network adapters that implement RDMA over Ethernet are being developed. Both Red Hat Enterprise Linux and Red Hat Enterprise MRG have support for RDMA.
Common RDMA implementations include the Virtual Interface Architecture
Virtual Interface Architecture
The Virtual Interface Architecture is an abstract model of a user-level zero-copy network, and is the basis for InfiniBand, iWARP and RoCE...
, InfiniBand
InfiniBand
InfiniBand is a switched fabric communications link used in high-performance computing and enterprise data centers. Its features include high throughput, low latency, quality of service and failover, and it is designed to be scalable...
, and iWARP
IWARP
The Internet Wide Area RDMA Protocol is a computer networking protocol for transferring data efficiently.It is sometimes referred to simply as "RDMA", though RDMA is not a feature exclusive to iWARP.-History:...
.
External links
- RDMA Consortium
- InfiniBand Trade Association
- A Tutorial of the RDMA Model
- RDMA usage
- A Critique of RDMA for High-Performance Computing