Introduction to Remote Direct Memory Access (RDMA)
Contents
What is RDMA?
Direct memory access (DMA) is an ability of a device to access host memory directly, without the intervention of the CPU(s).
RDMA (Remote DMA) is the ability of accessing (i.e. reading from or writing to) memory on a remote machine without interrupting the processing of the CPU(s) on that system.
So? why is this so good?
Using RDMA has the following major advantages:
- Zero-copy - applications can perform data transfer without the network software stack involvement and data is being send received directly to the buffers without being copied between the network layers.
- Kernel bypass - applications can perform data transfer directly from userspace without the need to perform context switches.
- No CPU involvement - applications can access remote memory without consuming any CPU in the remote machine. The remote memory machine will be read without any intervention of remote process (or processor). The caches in the remote CPU(s) won't be filled with the accessed memory content.
- Message based transactions - the data is handled as discrete messages and not as a stream, which eliminates the need of the application to separate the stream into different messages/transactions.
- Scatter/gather entries support - RDMA supports natively working with multiple scatter/gather entries i.e. reading multiple memory buffers and sending them as one stream or getting one stream and writing it to multiple memory buffers
Where can I find RDMA?
You can find RDMA in industries that need at least one the following:
- Low latency - For example: HPC, financial services, web 2.0
- High Bandwidth - For example: HPC, medical appliances, storage and backup systems, cloud computing
- Small CPU footprint - For example: HPC, cloud computing
And in many-many more other industries...
Which network protocols support RDMA?
Today, there are several network protocols which support RDMA:
- InfiniBand (IB) - a new generation network protocol which supports RDMA natively from the beginning. Since this is a new network technology, it requires NICs and switches which supports this technology.
- RDMA Over Converged Ethernet (RoCE) - a network protocol which allows performing RDMA over Ethernet network. Its lower network headers are Ethernet headers and its upper network headers (including the data) are InfiniBand headers. This allows using RDMA over standard Ethernet infrastructure (switches). Only the NICs should be special and support RoCE.
- Internet Wide Area RDMA Protocol (iWARP) - a network protocol which allows performing RDMA over TCP. There are features that exist in IB and RoCE and aren't supported in iWARP. This allows using RDMA over standard Ethernet infrastructure (switches). Only the NICs should be special and support iWARP (if CPU offloads are used) otherwise, all iWARP stacks can be implemented in SW and loosing most of the RDMA performance advantages.
Does it mean that I need to learn several programming APIs?
No. Luckily, the same API (i.e. verbs) can be used for all the above-mentioned RDMA enabled network protocols. In *nix it is libibverbs and kernel verbs and in Windows it is Network Direct (ND).
Are those network protocols interoperable?
Since those are different network protocols, their packets are completely different and they cannot send/receive messages directly without any router/gateway between them. However, the same code can support all of them. Since all those network protocols support libibverbs, the same binary can be used without even the need to recompile the source code.
Do I need to download special packages to use RDMA or is it part of the Operating System?
For several Operating Systems, RDMA support is embedded within the kernel. For example, Linux which supports RDMA natively and all major Linux distributions support it. Other Operating Systems may need to download a package (such as OFED) to add RDMA support to it.
Comments
Tell us what do you think.
Hello Dotan,
My question has to do with consistency in the face of concurrent accesses.
Specifically, I’d like to find the answers for the following questions:
1- Machine A submits an atomic operation to machine B’s memory.
Meanwhile, machine B’s CPU modifies the same memory region of its local main
memory (the modification is either in CPU cache or directly to the main memory).
Can these two operations interfere with each other? for example, can they cause
that region in the memory to end up corrupted?
2- Machine A submits an RDMA READ operation to machine B’s memory.
Meanwhile, machine B’s CPU modifies the same memory region (and hence the
changes are still in the cache).
Is the remote RDMA READ coming from machine A able to see the changes made
by B’s local CPU?
In other words, does RDMA READ flush out the remote CPU cache before
reading the remote memory region? If not, this might result in some dirty reads.
3- Do multiple operations from multiple HCA cards on a single machine interfere
with each other?
For example, machine A has 2 HCA cards, and it receives two concurrent atomic
operations, one from each of its HCA cards. Can they interfere with each other?
Thank you for your time in advance.
Cheers,
Erfan
Hi Efran.
1) The answer is depends on the supported atomicity level of the device, unless global atomicity is supported - yes, they may interfere and you may get corrupted data.
2) Yes. The RDMA Read should write the actual memory content; however, be careful from races if you don't sync between the two sides.
3) In general: the answer is depends on the supported atomicity level of the device, unless global atomicity is supported - they shouldn't interfere each other. However, it is implementation specific and it is hard for me to give a concrete answer on this.
:)
Thanks
Dotan
Once you do Memory Registration , Memory is Pinned and it cannot be updated by Any process /Cache .
Hi.
Yes. Once a memory block is pinned, its pages are assigned to the process that it was pinned to.
Thanks
Dotan
Hi Dotan,
From my understanding, RDMA is not strictly "no CPU involvement", as the direct memory access is achieved via DMA by remote NIC, and DMA needs to interrupt the CPU after done. Also in general, DMA is cache coherent; so the CPU caches will be invalidated/flushed due to some RDMA operations. Is that correct?
Thanks.
Skyler
Hi Skyler.
I'm not claim to be an expert,
but all the data transfer doesn't involve the CPU caches
(since a relatively small message may overwrite all the data in the caches).
The data (for example, in an RDMA write) is written directly to the memory.
But since DMA should be cache coherent, the memory controller invalidates/flushes the memory which exists in the cache
(otherwise, old data may be read from the memory).
So, I think that your description is very accurate.
Thanks
Dotan
Thanks a lot for your reply. It's a really excellent blog to learn about RDMA.
:)
Thanks for providing all RDMA related information. Have a question for you, Can I use RoCE to transfer data between two machines connected over WAN? I am planning to use SoftRoCE for this.
Hi.
I think that it will work.
However, I don't know what will be the performance of it.
It will be great if you'll update me on this experience
:)
Thanks
Dotan
thanks for rdma related information,i need some clearance, is rdma need some special interfaces or can i able to do it in Ethernet connection between two system
Hi.
RDMA requires special HW (i.e. RDMA NIC) or SW support.
If your NIC supports RoCE or iWARP, you can use RDMA.
Another option is to use RDMA which is SW based, for example: Soft RoCE,
Which will allow you to use RDMA over ANY NIC.
Thanks
Dotan
Hi, Dotan.
I need to implement gather-scatter with small tweak.
On the sender I'm using RDMA write with imm and supply sge_list to the wr.
On the receiver I want to take the received continues memory and scatter it according some sge_list.
From my understanding I need to issue ibv_post_recv() and specify the localhost as the remote_peer.
Do I need to allocate different MRs to the continues buffers and the buffers listed in the sge_list?
Hi.
You can't do what you want with RDMA Write with immediate;
however, if you'll change the opcode from RDMA Write to Send - it will work...
And yes, all the buffers that you'll use in the S/G list should be registered
(unless you use the "INLINE" option for small buffers).
Thanks
Dotan
hi,why when I receive an address and key from remote side, i want to read data using the address and key, but failed? the status is 9.
Hi.
The Work Completion status 9 means: IBV_WC_REM_INV_REQ_ERR.
Please check that remote write/read (depends on what you were trying to do),
is enabled in both responder's Queue Pair and Memory Region.
Thanks
Dotan
Hi, Dotan.
I have a 82599ES 10-Gigabit SFI/SFP+,can it use for RDMA?
Hi.
I'm not an expert for Intel products;
but from the card specifications I have a strong feeling that this device doesn't support RDMA.
Thanks
Dotan
Hi Dotan,
I am trying to understand how the flow of packets actually happen between a HCA(of Machine A) to the memory of Machine B without the intervention of the CPU and the operating Systems. Can you please explain this part?
Thanks,
Manjusha
Hi.
In a nutshell:
The control operations configures the RDMA device with all the needed attributes for the connection (and the used resources),
in the data path, there is a descriptor that describes which data to send, how and the destination (in Unreliable Datagram).
There RDMA device has a DMA engine that fetches the descriptors and the data,
and send the data directly to the wire, without the need for the CPU to process anything
(the network stack isn't involved at all).
The same happens at the responder side as well.
I hope that this helped
:)
Thanks
Dotan
When you say "without the need for the CPU" you are talking about kernel's need for CPU?
Yes.
No need to copy the data, handle reliability, (checksum and retransmission);
all handled by the RDMA device.
Thanks
Dotan
Hi Dotan,
Since Keys (R-Key) are transmitted over the wire during memory registration, anybody can see them, right? Any anybody can sniff the line and get the key info. Then how the memory accesses are protected from the middle-man attacks?
Hi.
Some words of ethics: I'm not a security expert, and security is a big deal nowdays.
If RDMA (and R-key) wasn't relevant; and you would have use only Send operation;
will you still have a man-in-the-middle attack?
Thanks
Dotan