Compare of verbs implementation vs. the specifications
Contents
The InfiniBand spec defines several features and verbs that the verbs implementation (i.e. RDMA stack in the Linux kernel and libibverbs) didn't implement or implemented in a different way.
In this post I will cover missing verbs and functionality that was defined in the specifications and how it was implemented.
Missing functionality
Reliable Datagram (RD)
The RDMA stack in the kernel and libibverbs don't support RD at all: RD isn't a valid transport type when creating a QP and all the following verbs that are relevant to manage its related resources weren't implemented:
- Allocate Reliable Datagram Domain
- Deallocate Reliable Datagram Domain
- Create EE Context
- Modify EE Context Attributes
- Query EE Context
- Destroy EE Context
Address Handle (AH)
- Modify Address Handle
- Query Address Handle
The RDMA stack in the kernel supports those verbs, but most low-level drivers don't support them.
libibverbs doesn't support those verbs at all.
Memory Region (MR)
- Reregister Memory Region
Libibverbs have preparations to support this verb. However, there isn't any implementation of it (yet?).
- Register Shared Memory Region
Libibverbs doesn't support this verb at all.
Memory Window (MW)
- Allocate Memory Window
- Query Memory Window
- Bind Memory Window
- Deallocate Memory Window
The RDMA stack in the kernel supports those verbs and some of the low-level drivers support them as well.
Libibverbs has preparations to support this verb. However, there isn't any implementation to it (yet?).
Changed functionality
Memory Region (MR)
- Query Memory Region
The RDMA stack in the kernel supports this verb, but most low-level drivers don't support it.
libibverbs doesn't support this verb verbs at all. However, the attributes addr, length (that were provided when registering the MR),
lkey and rkey (that were filled by ibv_reg_mr()) are part of struct ibv_mr and this replaced the need of this verb. The only attribute that cannot be retrieved from the MR after its creation is the access permissions to it.
Completion Queue (CQ)
- Query Completion Queue
The RDMA stack in the kernel and libibverbs don't support this verb at all. However, the attribute cqe is part of struct ibv_cq and this replaces the need of this verb.
- Set Completion Event Handler
The RDMA stack in the kernel and libibverbs don't support this verb at all. However, when calling ib_create_cq() in the RDMA stack in the kernel the client code can specify a CQ event handler.
In libibverbs the client code can create a thread that will call ibv_req_notify_cq(), ibv_get_cq_event() and ibv_ack_cq_events() and it actually behaves as a Completion Event Handler.
Asynchronous Event
- Set Asynchronous Event Handler
The RDMA stack in the kernel and libibverbs don't fully support this verb. However, when calling ib_create_cq(), ib_create_srq(), ib_create_qp() in the RDMA stack in the kernel the client code can specify Asynchronous Event Handler for those resources. Further more, the user can call ib_register_event_handler() to register the event handler for the RDMA device's events.
In libibverbs the client code can create a thread that will call ibv_get_async_event() and ibv_ack_async_event() and it actually behaves as an Asynchronous Event Handler.
eXtended Reliable Connected (XRC)
Annex A14 adds XRC to the IB spec. The following verbs were added:
- Allocate XRC Domain
- Deallocate XRC Domain
- Create XRC Shared Receive Queue
- Query XRC Shared Receive Queue
- Modify XRC Shared Receive Queue
- Destroy XRC Shared Receive Queue
- Create XRC Target Queue Pair
- Query XRC Target Queue Pair
- Modify XRC Target Queue Pair
- Destroy XRC Target Queue Pair
Most of this functionality was added to the RDMA stack in the kernel, either by adding new verbs or by extending the functionality of exiting ones (for example: instead of adding a new verb for creating an XRC Shared Receive Queue, ib_create_srq() was extended to support the creation of XRC SRQs as well).
However, libibvers doesn't support XRC at all.
Some notes:
1) There are some OFED distributions (such as MLNX-OFED) that have XRC support.
2) Patch that extends libibverbs to support XRC was sent to the mailing list, but they weren't (yet?) accepted to the libibverbs upstream.
Fast Memory Region (FMR)
The IB spec defines registering FMR in a Send Request. However, in the RDMA stack in the kernel there are verbs that allow creating of FMR pools using verbs too and not only using Send Requests.
General
The InfiniBand spec define special return values for errors that may happen when calling the verbs (for example: Invalid HCA handle, Invalid protection domain, Insufficient resources to complete request and more). The RDMA stack in the kernel and libibverbs using the errno values instead.
Comments
Tell us what do you think.
hi Dotan,
1) why RD is not considered while implementation ? is it because it does not have any use cases
2) XRC is more relevant in User space (libibverbs) as MPI may get benefit from it if its in user space. why it is restricted to kernel stack ?
Also it was there in OFED-1.5.4's libibverbs but removed from OFED-3.5 . Any reason ??
Hi Mahesh.
1) IMHO, RD has a lot of use cases. However, (AFAIK) there isn't a single HW that supports it. Because of this reason, the RDMA stack didn't add any support to it.
2) The answer is a little bit complicated:
XRC is mostly relevant for user space. However, the RDMA stack (kernel part) added support only in the kernel space.
There are some suggestions (and patches) to extend libibverbs in order to support XRC, but they weren't (yet) accepted.
For your question about the XRC removal, I *think* that the methodology of what is the content of the OFED distribution was changed (only take content from the upstream).
Thanks
Dotan
Hi Dotan,
I have large data and want to send it in a blockwised manner. You say there isn't any implementation of Memory Window. So how can I handle the problem?
Hi Baturay.
What is the reason that you think that Memory Window's will help you? How did you plan to use them?
Thanks
Dotan
Actually, I have a large integer vector and want to send it block by block. So I've planned that my program automatically registers the blocks in my vector. I mean I don't want to deregister and register the MR with block's address every time. I thought some windowing operation may help to solve the problem.
Hi Baturay.
Yes, I agree that Memory Windows could be handy for your task.
I wonder, what is the reason that you can't (or don't want) to (re)use the same Memory Region every time?
Thanks
Dotan
Hi Dotan.
I will use RDMA-Write. So when I reuse the same MR every time, I should send the virtual address and remote key to other side every time and it will take time. The communication cost is important in my study. That's the reason.
Hi Baturay.
But this issue won't be eliminated with Memory Windows;
you'll still need to send the virtual address and the remote key (of the Memory Window).
Thanks
Dotan
Hi Dotan,
Oh, thanks. But I wonder, if this is the case, what is the advantage of MW compared to MR?
The advantages of Memory Windows over Memory Regions is:
Light weight generation of r_keys (with changing permissions).
If you'll register and deregister memory, it will take a lot of time.
However, binding a Memory Windows to a Memory Region will generate a new r_key,
is a short time. If you want to invalidate this r_key, it takes short time as well
(since the Region is already registered).
I hope that I was clear on this..
Thanks
Dotan
Hi Dotan,
I understand. Actually, what I want to implement is that. I want to register memory for whole vector at once. And use some blocks of it without deregistering and registering again. Also, I don't want to do memcopy and of course to send virtual address and r-key every time. I hope I can explain my problem clearly. How can I handle this issue as your opinion?
I would suggest to register the memory buffers several times, with different permissions (if needed),
and provide the remote side the appropriate remote key+address to the block that it needs to access.
Thanks
Dotan
Hi Dotan,
Thanks for the posts! There are lots of information difficult to find elsewhere.
I have a question about the FMR section. As you said, "in the RDMA stack in the kernel there are verbs that allow creating of FMR pools using verbs too and not only using Send Requests." Assuming my RDMA cards support both of the methods (i.e., the FMR pool method and the using the Send Requests method), which one will have better performance in general?
BTW, my cards are the Mellanox ConnectX-3 Pro EN 40 Gigabit.
Thanks,
Jack
Hi Jack.
You are welcome
:)
It is hard for me to answer this question, and I would suggest for you to write a benchmark for your typical scenario and check which approach provides the better performance.
However, if you would ask me to guess:
I would suggest that the registration using Work Request will provide the best performance.
But again, this needs to be tested ...
Thanks
Dotan