ibv_post_send()
Contents
int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr); |
Description
ibv_post_send() posts a linked list of Work Requests (WRs) to the Send Queue of a Queue Pair (QP). ibv_post_send() go over all of the entries in the linked list, one by one, check that it is valid, generate an HW-specific Send Request out of it and add it to the tail of the QP's Send Queue without performing any context switch. The RDMA device will handle it (later) in an asynchronous way. If there is a failure in one of the WRs because the Send Queue is full or one of the attributes in the WR is bad, it stops immediately and returns the pointer to that WR. The QP will handle Work Requests in the Send queue according to the following rules:
- If the QP is in RESET, INIT or RTR state, an immediate error should be returned. However, some low-level drivers may not follow this rule (to eliminate extra check in the data path, thus providing better performance), and posting Send Requests at one or all of those states may be silently ignored.
- If the QP is in RTS state, Send Requests can be posted, and they will be processed.
- If the QP is in SQE or ERROR state, Send Requests can be posted, and they will be completed with error.
- If the QP is in SQD state, Send Requests can be posted, but won't be processed.
The struct ibv_send_wr describes the Work Request to the Send Queue of the QP, i.e., Send Request (SR).
struct ibv_send_wr { uint64_t wr_id; struct ibv_send_wr *next; struct ibv_sge *sg_list; int num_sge; enum ibv_wr_opcode opcode; int send_flags; uint32_t imm_data; union { struct { uint64_t remote_addr; uint32_t rkey; } rdma; struct { uint64_t remote_addr; uint64_t compare_add; uint64_t swap; uint32_t rkey; } atomic; struct { struct ibv_ah *ah; uint32_t remote_qpn; uint32_t remote_qkey; } ud; } wr; }; |
Here is the full description of struct ibv_send_wr:
wr_id | A 64 bits value associated with this WR. If a Work Completion will be generated when this Work Request ends, it will contain this value |
next | Pointer to the next WR in the linked list. NULL indicates that this is the last WR |
sg_list | Scatter/Gather array, as described in the table below. It specifies the buffers that will be read from or the buffers where data will be written in, depends on the used opcode. The entries in the list can specify memory blocks that were registered by different Memory Regions. The message size is the sum of all of the memory buffers length in the scatter/gather list |
num_sge | Size of the sg_list array. This number can be less or equal to the number of scatter/gather entries that the Queue Pair was created to support in the Send Queue (qp_init_attr.cap.max_send_sge). If this size is 0, this indicates that the message size is 0 |
opcode | The operation that this WR will perform. This value controls the way that data will be sent, the direction of the data flow and the used attributes in the WR. The value can be one of the following enumerated values:
|
send_flags | Describes the properties of the WR. It is either 0 or the bitwise OR of one or more of the following flags:
|
imm_data | (optional) A 32 bits number, in network order, in an SEND or RDMA WRITE opcodes that is being sent along with the payload to the remote side and placed in a Receive Work Completion and not in a remote memory buffer |
wr.rdma.remote_addr | Start address of remote memory block to access (read or write, depends on the opcode). Relevant only for RDMA WRITE (with immediate) and RDMA READ opcodes |
wr.rdma.rkey | r_key of the Memory Region that is being accessed at the remote side. Relevant only for RDMA WRITE (with immediate) and RDMA READ opcodes |
wr.atomic.remote_addr | Start address of remote memory block to access |
wr.atomic.compare_add | For Fetch and Add: the value that will be added to the content of the remote address. For compare and swap: the value to be compared with the content of the remote address. Relevant only for atomic operations |
wr.atomic.swap | Relevant only for compare and swap: the value to be written in the remote address if the value that was read is equal to the value in wr.atomic.compare_add. Relevant only for atomic operations |
wr.atomic.rkey | r_key of the Memory Region that is being accessed at the remote side. Relevant only for atomic operations |
wr.ud.ah | Address handle (AH) that describes how to send the packet. This AH must be valid until any posted Work Requests that uses it isn't considered outstanding anymore. Relevant only for UD QP |
wr.ud.remote_qpn | QP number of the destination QP. The value 0xFFFFFF indicated that this is a message to a multicast group. Relevant only for UD QP |
wr.ud.remote_qkey | Q_Key value of remote QP. Relevant only for UD QP |
The following table describes the supported opcodes for each QP Transport Service Type:
Opcode | UD | UC | RC |
---|---|---|---|
IBV_WR_SEND | X | X | X |
IBV_WR_SEND_WITH_IMM | X | X | X |
IBV_WR_RDMA_WRITE | X | X | |
IBV_WR_RDMA_WRITE_WITH_IMM | X | X | |
IBV_WR_RDMA_READ | X | ||
IBV_WR_ATOMIC_CMP_AND_SWP | X | ||
IBV_WR_ATOMIC_FETCH_AND_ADD | X |
struct ibv_sge describes a scatter/gather entry. The memory buffer that this entry describes must be registered until any posted Work Request that uses it isn't considered outstanding anymore. The order in which the RDMA device access the memory in a scatter/gather list isn't defined. This means that if some of the entries overlap the same memory address, the content of this address is undefined.
struct ibv_sge { uint64_t addr; uint32_t length; uint32_t lkey; }; |
Here is the full description of struct ibv_sge:
addr | The address of the buffer to read from or write to |
length | The length of the buffer in bytes. The value 0 is a special value and is equal to [latex]2^{31}[/latex] bytes (and not zero bytes, as one might imagine) |
lkey | The Local key of the Memory Region that this memory buffer was registered with |
Sending inline'd data is an implementation extension that isn't defined in any RDMA specification: it allows send the data itself in the Work Request (instead the scatter/gather entries) that is posted to the RDMA device. The memory that holds this message doesn't have to be registered. There isn't any verb that specifies the maximum message size that can be sent inline'd in a QP. Some of the RDMA devices support it. In some RDMA devices, creating a QP with will set the value of max_inline_data to the size of messages that can be sent using the requested number of scatter/gather elements of the Send Queue. If others, one should specify explicitly the message size to be sent inline before the creation of a QP. for those devices, it is advised to try to create the QP with the required message size and continue to decrease it if the QP creation fails. While a WR is considered outstanding:
- If the WR sends data, the local memory buffers content shouldn't be changed since one doesn't know when the RDMA device will stop reading from it (one exception is inline data)
- If the WR reads data, the local memory buffers content shouldn't be read since one doesn't know when the RDMA device will stop writing new content to it
Parameters
Name | Direction | Description |
---|---|---|
qp | in | Queue Pair that was returned from ibv_create_qp() |
wr | in | Linked list of Work Requests to be posted to the Send Queue of the Queue Pair |
bad_wr | out | A pointer to that will be filled with the first Work Request that its processing failed |
Return Values
Value | Description |
---|---|
0 | On success |
errno | On failure and no change will be done to the QP and bad_wr points to the SR that failed to be posted |
EINVAL | Invalid value provided in wr |
ENOMEM | Send Queue is full or not enough resources to complete this operation |
EFAULT | Invalid value provided in qp |
Examples
1) Posting a WR with the Send operation to an UC or RC QP:
struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_SEND; wr.send_flags = IBV_SEND_SIGNALED; if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; } |
2) Posting a WR with the Send with immediate operation to an UD QP:
struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_SEND_WITH_IMM; wr.send_flags = IBV_SEND_SIGNALED; wr.imm_data = htonl(0x1234); wr.wr.ud.ah = ah; wr.wr.ud.remote_qpn = remote_qpn; wr.wr.ud.remote_qkey = 0x11111111; if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; } |
3) Posting a WR with an RDMA Write operation to an UC or RC QP:
struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_RDMA_WRITE; wr.send_flags = IBV_SEND_SIGNALED; wr.wr.rdma.remote_addr = remote_address wr.wr.rdma.rkey = remote_key; if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; } |
4) Posting a WR with an RDMA Write with immediate operation to an UC or RC QP:
struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_RDMA_WRITE_WITH_IMM; wr.send_flags = IBV_SEND_SIGNALED; wr.imm_data = htonl(0x1234); wr.wr.rdma.remote_addr = remote_address wr.wr.rdma.rkey = remote_key; if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; } |
5) Posting a WR with an RDMA Read operation to a RC QP:
struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_RDMA_READ; wr.send_flags = IBV_SEND_SIGNALED; wr.wr.rdma.remote_addr = remote_address wr.wr.rdma.rkey = remote_key; if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; } |
6) Posting a WR with a Compare and Swap operation to a RC QP:
struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_ATOMIC_CMP_AND_SWP; wr.send_flags = IBV_SEND_SIGNALED; wr.wr.atomic.remote_addr = remote_address wr.wr.atomic.rkey = remote_key; wr.wr.atomic.compare_add = 0ULL; /* expected value in remote address */ wr.wr.atomic.swap = 1ULL; /* the value that remote address will be assigned to */ if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; } |
7) Posting a WR with a Fetch and Add operation to a RC QP:
struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_ATOMIC_FETCH_AND_ADD; wr.send_flags = IBV_SEND_SIGNALED; wr.wr.atomic.remote_addr = remote_address wr.wr.atomic.rkey = remote_key; wr.wr.atomic.compare_add = 1ULL; /* value to be added to the remote address content */ if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; } |
8) Posting a WR with the Send operation to an UC or RC QP with zero bytes:
struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = NULL; wr.num_sge = 0; wr.opcode = IBV_WR_SEND; wr.send_flags = IBV_SEND_SIGNALED; if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; } |
FAQs
Does ibv_post_send() cause a context switch?
No. Posting a SR doesn't cause a context switch at all; this is why RDMA technologies can achieve very low latency (below 1 usec).
How many WRs can I post?
There is a limit to the maximum number of outstanding WRs for a QP. This value was specified when the QP was created.
Can I know how many WRs are outstanding in a Work Queue?
No, you can't. You should keep track of the number of outstanding WRs according to the number of posted WRs and the number of Work Completions that you polled.
Does the remote side is aware of the fact that RDMA operations are being performed in its memory?
No, this is the idea of RDMA.
If the remote side isn't aware of RDMA operations are being performed in its memory, isn't this a security hole?
Actually, no. For several reasons:
- In order to allow incoming RDMA operations to a QP, the QP should be configured to enable remote operations
- In order to allow incoming RDMA access to a MR, the MR should be registered with those remote permissions enabled
- The remote side must know the r_key and the memory addresses in order to be able to access remote memory
What will happen if I will deregister an MR that is used by an outstanding WR?
When processing a WR, if one of the MRs that are specified in the WR isn't valid, a Work Completion with error will be generated. The only exception for this is posting inline data.
What is the benefit from using IBV_SEND_INLINE?
Using inline data usually provides better performance (i.e. latency).
What is the difference between inline data and immediate data?
Using immediate data means that out of band data will be sent from the local QP to the remote QP: if this is an SEND opcode, this data will exist in the Work Completion, if this is a RDMA WRITE opcode, a WR will be consumed from the remote QP's Receive Queue. Inline data influence only the way that the RDMA device gets the data to send; The remote side isn't aware of the fact that it this WR was sent inline.
I called ibv_post_send() and I got segmentation fault, what happened?
There may be several reasons for this to happen:
1) At least one of the sg_list entries is in invalid address
2) In one of the posted SRs, IBV_SEND_INLINE is set in send_flags, but one of the buffers in sg_list is pointing to an illegal address
3) The value of next points to an invalid address
4) Error occurred in one of the posted SRs (bad value in the SR or full Work Queue) and the variable bad_wr is NULL
5) A UD QP is used and wr.ud.ah points to an invalid address
Help, I've posted and Send Request and it wasn't completed with a corresponding Work Completion. What happened?
In order to debug this kind of problem, one should do the following:
- Verify that a Send Request was actually posted
- Wait enough time, maybe a Work Completion will eventually be generated
- Verify that the logical port state of the RDMA device is IBV_PORT_ACTIVE
- Verify that the QP state is RTS
- If this is an RC QP, verify that the timeout value that was configured in ibv_modify_qp() isn't 0 since if a packet will be dropped, this may lead to infinite timeout
- If this is an RC QP, verify that the timeout and retry_cnt values combination that were configured in ibv_modify_qp() doesn't indicate that long time will pass before a Work Completion with IBV_WC_RETRY_EXC_ERR will be generated
- If this is an RC QP, verify that the rnr_retry value that was configured in ibv_modify_qp() isn't 7 since this may lead to retry infinite time in case of RNR flow
- If this is an RC QP, verify that the min_rnr_timer and rnr_retry values combination that were configured in ibv_modify_qp() doesn't indicate that long time will pass before a Work Completion with IBV_WC_RNR_RETRY_EXC_ERR will be generated
How can I send a zero bytes message?
In order to send a zero byes message, no matter what is the opcode, the num_sge must be set to zero.
Can I (re)use the Send Request after ibv_post_send() returned?
Yes. This verb translates the Send Request from the libibverbs abstraction to a HW-specific Send Request and you can (re)use both the Send Request and the s/g list within it.
Comments
Tell us what do you think.
I have a question about whether a context switch is occurred or not during an RDMA operation. Here (page 15) it is shown that a user space verbs call results in a call of the hardware specific driver (eg mlx4). That "lives" in kernel space. So, ibv_post_send() (RDMA mode) causes a context switch, or not? Can you clarify this for me please.
Also, if ibv_post_send() never causes a context switch, then why there is an implementation of ibv_post_send() in the linux kernel. When is this function (inside the kernel) called?
Thanks!
This is a great question!
Every control operation (i.e. create/destroy/modify/query to any resource) will cause a context switch.
However, the data operations won't create a context switch and from the same context,
one can post new Work Request (either to the Send or Receive Queues).
In the example, you mentioned "mlx4"; the create Queue Pair will perform a context switch and the following libraries/modules will be called in order:
libibverbs -> libmlx4 -> libibverbs -> ib core -> mlx4
In order to post a Send Request, the following libraries/modules will be called in order:
libibverbs -> libmlx4
i.e. no context switch will happen at all.
However, if there will be devices (or low-level drivers) that doesn't support posting Send Requests without a context switch, the libibverbs prepared the infrastructure to allow posting the Work Requests in the kernel level.
Personally, I don't know about any device that uses those functions.
I hope that I answered all of your questions.
Thanks
Dotan
Yes, you did! Thanks!
ps: I forgot to paste the link I was referring to in my first post. Here it is (from OpenFabrics) --> https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CC8QFjAA&url=https%3A%2F%2Fwww.openfabrics.org%2Fofa-documents%2Fpresentations%2Fdoc_download%2F522-openfabrics-training-programs.html&ei=fW83UffUDo3Osgam0oDYCg&usg=AFQjCNEyvglOCK0V-6jnoGsqyiYEH3kQDw&bvm=bv.43287494,d.Yms&cad=rja
So in page 15 the Hardware Specific Driver (yellow box) might be the libmlx4 depending on the implementation (or it might be mlx4 linux kernel module in ./drivers/infiniband/hw/mlx4 otherwise). Am I right?
Thanks for this link, now I fully understand your question...
The Hardware Specific Driver (yellow box) is the mlx4 kernel part (since this section describes the kernel space modules). The User level APIs (white box) is the libibverbs and libmlx4.
(Do you see the "kernel bypass" line? this means direct access to the HW without need for performing context switch).
Yes, I see the "kernel bypass" line. But that makes a contradiction. Kernel bypass from the one hand, but libmlx4 calls something (Hardware Specific Driver (mlx4 kernel module)) that "lives" inside the kernel (kernel context)). Except if the author of the diagram is meaning that the line is going to the Infiniband HCA directly (firmware code). :P
Sorry for being persistent!
It is o.k.
:)
The "kernel bypass" means that in the data path, your user level code will be able
to work directly with the HW (without performing a context switch).
Please remember that the kernel level must be involved in the control part in order to
sync the resources (between different processes/modules) and configure the HW since user
level application can't write directly in the device memory space (since this is a privileged operation).
In this slide, I can see that there are two lines:
1) First line that specify kernel bypass (for the data path)
2) Second line that specify that the user level will call to the open fabrics kernel level verbs
I hope that I answered all of your questions.
If you enjoy this blog, please publish it to other people as well.
Thanks
Dotan
Hi Dotan,
I have few questions about ibv_post_send():
1. If I issue one large send request,
will (or can it be) served by multiple
smaller receive buffers? or does one
send request can never use multiple recv
buffers?
2. when would I need to use IBV_SEND_SIGNALED and IBV_SEND_SOLICITED?
3. Can Receive buffer be a gather list and
HCA will dma the received data to appropriate gather elements?
Hi Jay.
I'll try to answer:
1) I assume that you mean that you send a big message over the wire,
At the receive side you can split this message to how many scatter elements
that you wish (this is a local attribute which).
To summarize it:
When using RDMA operation(s): only one contiguous buffer can be used
When using Send operation: the receive side can post Receive Receive with one
or more scatter elements (as long as the sum of the buffers will be able to hold
all of the message).
2) IBV_SEND_SIGNALED should be used if the QP was created with sq_sig_all=0
(which means that not all Send Requests will generate Work Completion when completed).
IBV_SEND_SOLICITED should be used when the remote side is reading the Work Completions
using events (and not in polling mode). Please check the my about ibv_req_notify_cq()
for more details.
3) Yes, this is exactly what the RDMA device will do in the Receive side,
when using the Send operation. Please keep in mind that those memory buffers should
be registered first.
I hope that I answered all of your questions.
If you enjoy this blog, please publish it to other people as well.
Thanks
Dotan
Hi Dotan,
Please let me rephrase the question #1 -
Receive side has posted two receive work
request with n-bytes worth of buffer each.
So receiver has total of 2n byte buffer
available.
Now sender issues one send work request
with total of 2n+m byte data.
Can receiver use two receive work requests
to satisfy one send work request?
When using RDMA operations you said one single contig. buffer can be used.
Do you mean RDMA write OR RDMA read?
Thank you so much for your reply.
Jay
Hi Jay.
The Receive Request is working in resolution of messages and not in resolution of bytes.
Every Receive Request will handle only one incoming message:
for each incoming message one Receive Request will be fetched from the head
of the Receive Queue. The messages will be handled by the order of their arrival.
In your example there are 2 Receive Requests that each has n bytes:
* Receiving a message of n bytes or less, is fine
* Receiving a message with more than n bytes will cause an error (since there isn't enough room to hold the message)
When working with RDMA operations:
* RDMA Write can read one or more local gather entries and write them to one remote contiguous block
* RDMA Read can read from one remote contiguous block and write it locally to one or more scatter entries
If you have more questions, you are more than welcome to ask..
:)
Dotan
when i send a 1024 bite block by IBV_WR_RDMA_WRITE mode,everything is ok, but if block size is set larger (ex 4096 bite),I get a IBV_WC_LOC_PROT_ERR err and then many IBV_WC_WR_FLUSH_ERR err for send cq , can u help me
Hi.
Please check the memory buffers in the gather list of the Send Request, I suspect that you try to access memory that wasn't registered.
Thanks
Dotan
ibv_post_send returns -1,what is the problem ? thanks for your help
Hi.
There can several reasons:
* The Send Request has invalid value(s)
* The Send Queue is full
Not all of the low level drivers return errno to indicate about errors
(some of them returned -1 in the past and now return errno).
It depends of the library that you use and its version..
Thanks
Dotan
Hi Dotan, I'm running into a problem with ibv_post_send and hoping you can provide some guidance. I've adapted the rc_ping_pong program to exchange 312 byte messages among nodes in a 32-machine IB cluster, except that I use an epoll() based mechanism to call ibv_poll_cq(). Several messages later (around 58900 to be exact), ibv_post_send() fails returning ENOMEM and errno set to 2. Both sides of the connection are in good states: IBV_PORT_ACTIVE & IBV_QPS_RTS. When I keep track of sends posted vs sends completed I find that during the failure (posted-completed) = 31, always. However I have only max_send_wr=1 when I created the qp. So I'm not sure what's going on. On the receive side I guarantee posts (rx_depth=800 and whenever it drops to 400 I post 400 more). Any help is much appreciated, and if you need further clarifications please let me know.
Thanks much
Sara
Hi Sara.
I will try to help you
:)
If ibv_post_send() itself fails that means that either:
The Send Queue is full (i.e. all of the Work Requests in the Send Queue are outstanding)
or
The posted Send Request is illegal:
* too many scatter/gather elements
* too much inline data (if inline data is used)
* wrong opcode
Please check if this helps you:
if you sure that the Send Queue isn't full, dump the Send Request and check what I suggested above.
Thanks
Dotan
Thanks for the quick response Dotan.
I'm leaning towards full queue rather than illegal request because:
1. They've been going through fine for all the previous posts, and
2. I simply reuse circular buffers for subsequent sends
3. I inspected the wr (bad_wr points to it) during failure and it looks okay:
(gdb) p wr
$1 = {wr_id = 1, next = 0x0, sg_list = 0x7fcaca7fbcb8, num_sge = 1, opcode = IBV_WR_SEND, send_flags = 2, imm_data = 0, wr = {rdma = {remote_addr = 0, rkey = 0}, atomic = {remote_addr = 0,
compare_add = 0, swap = 0, rkey = 0}, ud = {ah = 0x0, remote_qpn = 0, remote_qkey = 0}}}
(gdb) p *wr->sg_list
$11 = {addr = 49981952, length = 312, lkey = 175104}
I'm confused about two things though (if send queue full is the problem):
1. ibv_post_send() returns ENOMEM (and not -ENOMEM which is what the drivers seem to return when kmalloc fails or something similar)
2. errno=2 which is also weird, I'm unable to find out exactly who sets it & why
I've also tried running it through valgrind to check invalid memory and it looks clean.
Any pointers?
Thanks
Sara
Hi Sara.
I'll try to help here:
1) User level libraries return positive errno values and not negative ones
(kernel level drivers return negative errno values)
2) I don't know where the errno=2 came from. libmlx4 almost doesn't set the errno value
at all..
Did you poll all of the completions from the CQ?
Once you have the failure in the ibv_post_send(), did you try to empty the CQ and try to post the Send Request again?
(since the QP should still be in a good shape)
Thanks
Dotan
Thanks, Dotan! Once I reach this point, all polls keep returning 0, and if I attempt to post more sends I run into the same issue. The other side is sitting idle doing an epoll_wait() with plenty of recvs posted. So it doesn't look like an easy problem to solve. I'll try a few more experiments & update (in case someone runs into similar issues later).
Sara
This will be great, thanks!
Dotan
Just wanted to update on this issue real quick. I restructured the code quite a bit to make it extensible and now I don't hit upon the issue anymore. So most likely some bad coding on my part - if I had more time to spare I'll explore in detail but unfortunately I'm on a deadline so don't have a clear answer :(
Thanks for your help Dotan!
Hi Sara.
I'm happy that you overcome the bug
:)
You are most welcome!
Dotan
Hi Dotan,
I'm receiving 'remote invalid request error' (IBV_WC_REM_INV_REQ_ERR) with RDMA_READ requests. I checked buffer sizes, access rights, and QP-type and all seams fine to me. RDMA_WRITE works and since the only difference is the opcode (as far as I know), I don't understand the issue.
BTW: I'm new to RDMA programming and your side really helps a lot!
Thanks so far.
Hi Stefan.
Sharing the code will be great (it will allow me to review it and give feedback..)
Nevertheless, I will try to help you
:)
Assuming that you have both RDMA Read and RDMA Write code,
the delta between the RDMA Write to the RDMA Read support should be:
1) The QP type is IBV_QPT_RC
2) The mask IBV_ACCESS_REMOTE_READ is enabled in the responder's MR
3) The mask IBV_ACCESS_REMOTE_READ is enabled in the responder's QP (qp_access_flags)
4) The values of max_rd_atomic/max_dest_rd_atomic aren't zero
(setting the value to one in both sides isn't efficient but will do the trick)
5) verify that the r_key is correct (although if it worked with RDMA Write, it should be valid)
I hope that I helped you.
If you enjoy this blog, please publish it to other people as well.
Thanks
Dotan
Hi Dotan,
Thanks for the fast reply. I re-cheched all again and found:
1) .qp_type of ibv_qp_init_attr is IBV_QPT_RC (OK)
2) access mask was set by
if (!(remote_mr = ibv_reg_mr(remote_pd, pmydata->recv_buffer, pmydata->max,
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_LOCAL_WRITE |
IBV_ACCESS_REMOTE_READ))) {
perror("ibv_reg_mr");
return NULL;
}
Which left the flags of the QP unchainged. I set them now by calling ibv_modify_qp. The flags seam to be alright now, but the error remains.
3) Both communication partners have the same flags, for their QPs and MRs so this should be ok.
4) Both, max_rd_atomic and max_dest_rd_atomic are set to 1 by default here. I checked it and it should also be ok.
5) As you mention, since RDMA_WRITE works r_key,l_key, and remote_addr are ok. (I also re-checked that)
What seams strange is, that ibv_modify_qp raised an invalid argument error when I called it with IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MAX_QP_RD_ATOMIC to modify the values but modifying access flags works fine.
Code actually is a mess but basically consist of this parts:
* rdma_create_event_channel() to create event channels
* rdma_create_id() to create rdma_cm_id's
* rdma_bind_addr() and rdma_listen() on the server side
* rdma_resolve_addr() and rdma_resolve_route() on the client side
* ibv_create_cq(), ibv_alloc_pd(), rdma_create_qp() and ibv_reg_mr() to setup CQ,PD and register MR
* Exchange Key and memory Address
* Message setup:
// current message size
sge.length = imyproblemsize;
// Buffer address == MR address and is large enough
sge.addr = (uint64_t)pmydata->recv_buffer;
sge.lkey = client_mr->lkey;
snd_wr.sg_list = &sge;
snd_wr.num_sge = 1;
snd_wr.opcode = IBV_WR_RDMA_READ;
snd_wr.send_flags = IBV_SEND_SIGNALED;
snd_wr.next = NULL;
snd_wr.wr.rdma.remote_addr = rAddr;
snd_wr.wr.rdma.rkey = rKey;
* Start Work:
if (ibv_post_send(client_id->qp, &snd_wr, NULL)) {
perror("21 ibv_post_send");
return -21;
}
while (!ibv_poll_cq(client_cq, 1, &wc));
if (wc.status != IBV_WC_SUCCESS) {
printf("r0: wc.status: %s\n",ibv_wc_status_str(wc.status));
perror("22 ibv_poll_cq");
return -22;
}
The code is some kind of skeleton I wrote and originally covers send/receive wich works fine. Also modifying it to work with RDMA_READ caused no problem, but RDMA_WRITE does.
Thanks a lot.
Hi Stefan.
Can you call ibv_query_qp() when the QP should be in RTS state and verify that:
1) The QP state is RTS
2) The value of max_rd_atomic isn't zero
3) The value of max_dest_rd_atomic isn't zero
I suspect that the fact that ibv_modify_qp() failed is your problem.
(please check my post about ibv_modify_qp() and make sure that you
use the right flags for each QP state transition)
Thanks
Dotan
Hello Dotan,
I'm measuring latency between two RDMA NICs with IBV_WR_SEND
If I send a work request with IBV_SEND_SIGNALED flag, so when I get
IBV_WC_SEND event, does it mean that the message was delivered and the remote machine sent an ack back? Should I consider this time as a roundtrip?
Thanks.
Hi.
It depends on the used transport type:
* If this is reliable transport type (RC), when you get Work Completion in the sender side - this means that the message was written at the remote side (and an ACK was sent back)
* If this is unreliable transport type (UC/UD), when you get Work Completion in the sender side - this means that the message was sent through the local port (no ACK/NACK will be sent)
I hope that I answered your question.
Thanks
Dotan
Thanks a lot.
That what I've assumed.
I'm using RC, just to make clear, the following flow
1. post-receive
2. start timer
3. send message (IBV_WC_SEND)
4. wait for receive to complete (send on the other is posted only when message arrived)
5. stop timer
it measures: 2 messages + ACK for the first send + (optional: ACK to other side of received message)
Thanks.
Boris.
Exactly.
One tip though: if you care about latency, you should send the message inline'd
(if the message is small).
Thanks
Dotan
Hello Dotan
in ibv_post_send:
1. Are the ibv_send_wr list, and its sg_list destroyed automatically when the operation completes.
2. Or can I destroy them after the method call returns.
3. They have to be kept alive till receiving work completion.
Boris.
Thanks.
Hi Boris.
The sg_list array can be safety be (re)used after ibv_post_send() ends:
The Send Request is being enquequed to the Send Queue space of the Queue Pair
once it is being posted.
Thanks
Dotan
Hi Dotan.
Is there any way to know, what is the max length of INLINE data can be sent in SEND or RDMA_WRITE ?
Hi.
Unfortunately, struct ibv_device_attr doesn't contain any attribute that specify the maximum INLINE data that can be sent.
When creating a QP, qp_init_attr->cap->max_inline_data is returned with the number of INLINE data that can be sent in this QP.
Thanks
Dotan
Hi,
I'm new to RDMA and run into a weird behavior, which I was hoping you could clarify for me:
I'm using IBV_WR_SEND to send a struct-object which contains some information needed for an RDMA-read later on (rkeys, address and so on).
Now in principle this works fine, but the strange behavior is that only if the object-size is a multiple of 2, does it work correctly. So I tried these cases:
sizeof(message) -> 16. This works
sizeof(message) -> 24. The last object-attribute is always wrong, the rest is correct.
sizeof(message) -> 32. This works again.
Is this normal? I have only seen restrictions about the minimum/maximum message size, but nothing that would hint at an additional restriction of this kind. Or did I something wrong somewhere?
Thank you very much!
Martin
Hi Martin.
I have a feeling that the problem isn't related to RDMA.
In RDMA the minimum message size can be even 0 bytes!
I have a feeling that the problem happens because of the way the compiler prepare the structure in the memory
(padding, etc..).
In RDMA and in any other networking protocol the application needs to take care of how to transfer data between two machines since maybe the machines are different:
* CPU arch (32/64) bits
* Big/little endian
I have two suggestion here:
1) You can send me the source code for review, and I'll give you feedback
2) You can give me more information on what went wrong (since you didn't provide this information)
Thanks
Dotan
Hi,
thank you for your reply.
Sorry for my late response, but I was busy the last week.
So, I have a struct containing: int rkey, int remote buffer size, long remote address
If I send this, everything is fine. But now suppose I add "int id" to the struct. No matter which attribute is specified last in the struct (lets say for example "int id" is now the last one), that attribute is not recieved correctly, but gives a wrong value. All other attributes of that struct are correct.
You are probably correct that this is due to some little/big endian problem.
Thank you very much!
Cheers,
Martin
Hi Martin.
Do you want to share the code with me? This way I'll find your bug ...
Another way for you to handle it is to write (using sprintf()) the data to an array of characters,
and this this data as a string as not as a struct (and parse it in the remote side).
I hope that this tip helped you
Dotan
Hello Dotan,
I have the same problem as Stefan (I get IBV_WC_REM_INV_REQ_ERR with RDMA_READ requests. I tried to follow the advice you already posted here as much as possible, but I cannot sort that out myself.
I can send you a simple program which reproduces my problem, but I would need your email (and your agreement).
Best regards,
Philippe
Hi Philippe.
If you want to share the code with me, and I'll give you a hint
on the reason of this problem, you can send it to:
support at rdmamojo dot com
Thanks
Dotan
Hi Dotan,
I have a question about P_KEYS in BTH header. Once a relation is established between two QP's, both ends can modify the qp attribute pkey_index. Can both ends use different pkey_index (and ultimately different pkeys) ? i.e A can say B is using Pkey=X and B can say A is using Pkey=Y.
Thanks,
Jay
Hi Jay.
It doesn't matter what are the P_Key index that each QP is pointing to
(since what is really matters is the P_Key value itself and different tables
*may* have same P_Key values but with different order).
If at some point, the P_Key values of both QPs won't be consistent,
the packet will be dropped
(InfiniBand spec: Figure 81 Packet Header Validation Process)
In your example: if X.key != Y.key, there will be a P-Key mismatch and
the QPs won't be able to communicate (this is the whole idea of the P_Key..)
I hope that I helped you.
Thanks
Dotan
I am trying to use IBV_WR_ATOMIC_CMP_AND_SWP to check a remote value and proceed accordingly. I have registered a 64 bit integer using ibv_reg_mr. and sent this remote address to the sending host. But i am getting a remote access error. The sample code you have provided is not complete.
In the sample code you have used
sg.addr = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey = mr->lkey;
Is buf_addr a 64 bit integer or a char buffer of size 8. Is it possible that you may send a complete code of a working compare and swap function.
Hi Omar.
I'm sorry, but I don't have any source code that I can share with you...
(I plan to write it in the future though)
Please make sure that:
1) The remote QP supports incoming Atomic operations
2) The remote MR supports incoming Atomic operations
3) The remote address is 8 bytes aligned
Thanks
Dotan
Hi, I came across the same problem, and still cannot figure it out. I can successfully process send/recv operation(which means qpn, psn and lid of the remote side is correct), but I fail at RDMA write operation, receiving the IBV_WC_REM_ACCESS_ERR (10) - Remote Access Error when I call ibv_poll_cq(). Any other comments besides the above three hints?? thanks in advance.
Hi.
Did you read my post on ibv_poll_cq()?
Anyway, check that RDMA Write is enabled in both the remote QP.qp_access_flags and remote MR.access.
Thanks
Dotan
Hi Dotan,
I have a question about how WRs are finished. Suppose I have built a RC connection between two QPs. First on the receive side, I post two recv WRs, say recv_wr1 and recv_wr2. Then on the send side, I post two send WRs, say send_wr1, send_wr2. My question is, is there any possibility that send_wr2 finishes before send_wr1? What about the receive side? Is is possible that recv_wr2 is finished before recv_wr1?
Thanks,
Jiajun
Hi Jiajun.
In term of the Completion Queue of the Work Queues, you should see their Work Completions according to the order of the corresponding posted Send Requests.
In term of the wire, this isn't a place that I fully familiar with, BUT:
if you send a message, every packet increases the PSN in the Send Queue and in the remote Receive Queue),
so send_wr2 cannot be sent before send_wr1 was sent. Otherwise, it won't be able to detect missing packets (using the PSNs).
Anyway, you should (re)use the memory only after the relevant Work Request isn't outstanding any more.
I hope that this helped you.
Thanks
Dotan
Hi
My question might seem out of context for this post but it's important.
I have to ask you how to set up an all to all communication between a number of processes, some on same machine and some on different. What I have done is open a listening rdma_cm_id wait for incoming connection requests for each process and bind it to a specific port and create new rdma_cm_id when I have completed a connection request. This works fine if all processes are on different host machines, but if I start multiple processes on the same machine, I get a very slow performance or none at all, the system hangs as if in a deadlock. I had hoped that once I have a rdma_cm_id for each process than the processes should communicate without any problem. One thing is that I have only set up one communication channel but it should suffice for many clients (the man pages say this).
Regards
Omar
Hi Omar.
I really sorry, but I can't help you with this...
I don't have a lot of experience with rdma_cm (yet?).
If you want a good answer, I suggest that you'll send this question to Sean Hefty,
the writer and maintainer of rdma_cm.
Sorry again..
Dotan
It seems that the descriptions of IBV_WR_ATOMIC_CMP_AND_SWP and IBV_WR_ATOMIC_FETCH_AND_ADD are swapped.
Fixed, thanks.
Dotan
Hi Dotan,
What circumstances can make a Send Queue to get overflown?
In my program I perform an RDMA Write in a loop (every time with the same source/destination addresses, just to test), and after a while I constantly get ENOMEM from ibv_post_send(). It doesn't seem to be a race, as it always happens after the same count of iterations, and even sleeping ~1sec between iterations doesn't affect anything; besides, the number of successful iterations is correlated with QP's max_send_wr. None of the WRs is "signaled" (tried to poll the QC at every iteration - it's empty).
I might be missing something basic in the QP configuration. What initialization parameter can cause such a behavior?
Thanks.
When creating a QP, you specify how many WRs are outstanding in either Send or Receive Queue.
A WR is considered outstanding until there is a Work Completion for it or for other WRs in that Work Queue.
You posted many WRs (in your case, to the Send Queue) and all of them are outstanding.
From time to time, you need to make them "signaled" and read the Work Completions.
Thanks
Dotan
Oh, I see. This looks like a design flaw, doesn't it? At least, it's quite counter-intuitive behavior, as one would expect that an unsignaled WR gets removed from SQ silently as soon as it's processed - after all, that's the whole point of unsignaled WRs...
But if you don't get any Work Completion, how can you prevent from posting more WRs than the Work Queue size?
You *assume* that all the posted WRs were processed, in most cases it is true,
but there isn't any guarantee about it...
Thanks
Dotan
Well, if one produces WRs faster than the HCA can consume, the SQ will eventually overflow, and in *such* situation ENOMEM would be quite logical (like in any producer-consumer scheme) - but still, implicitly treating obviously consumed WRs as outstanding doesn't seem to fit well in this logic. Sometimes the producer can know for sure that he can never overflow the queue (for instance due to retry count/timeout settings vs. timings of WRs), and such a behavior of the queue would surprise him.
You start to enter to the synchronization mechanism between the low-level driver and the HW...
Anyway, this is the behavior which the protocol defined.
Thanks
Dotan
Hi Dotan, joining the question on this issue. Is there any way (or will be) to block on ibv_post_send (until there is place in the work queue)?
Otherwise, in multythreaded application, some synchronization semaphore-like mechanism must be applied, and it could be very costly...
Hi Boris.
Currently, there isn't any way to block the post_send if the Work Queue is full.
This require a low-level libraries and API change (to prevent breaking of current behavior).
This isn't anything that I can help you with.
Sorry
Dotan
Dotan,
You're writing regarding the inline data that "the low-level driver (i.e. CPU) will read the data and not the RDMA device". Is this correct for the both sides? I.e., on the responding side, will the HCA perform DMA for the inlined data, or will CPU handle it?
Thanks a lot for your assistance.
This is relevant only for the local side, i.e. the side that fetches the data.
There isn't any hint that this was done once the data is being sent over the wire.
Thanks
Dotan
Hi Dotan,
Is there a more straightforward and efficient way to write a value atomicaly to the remote side, than performing rdma-read followed by atomic CAS? (There're no stores to this location on the remote side, only loads, but the value must appear consistently/atomically.)
Thanks.
Hi Igor.
The only supported atomic operations in RDMA are:
* Fetch and Add
* Compare and Swap
I don't know what you are trying to achieve, but using them you can implement
a mutual exclusion primitives.
What about sending a message using "Send" and increment the value locally using a good old mutex/semaphore/spinlock?
Thanks
Dotan
Due to some constraints I can't use send/receive flow...
What the level of atomicity of a regular RDMAWrite? I.e., does the remote HCA stores to its local memory bytes or words?
I'm sorry, but I can't provide a good answer here.
RDMA supports sending a stream of bytes and AFAIK there isn't any guarantee about atomic access of more than one bytes.
Multiple testing may show you that atomicity of words (or more) is achieved, but there may be scenario that this won't be the case...
Dotan
Hi Dotan,
Great website. Thanks for all the work.
Question about posting WRs. If I post a WR to a WQ, does a copy of the WR get made so that after the ibv_post_send() completes, I am free to overwrite that WR for my own purposes? Or is just a pointer to that WR posted to the WQ and I have to keep it intact until the completion occurs. It tried to find the internal representation of the WQs to see if I could deduce the answer myself, but no luck.
Thanks
:)
Short answer: yes.
Long answer: the low-level driver translate the Work Request structure from verbs API to HW API
and post this HW-specific WR to the the relevant Work Queue.
After the verb of posting the WR returns, you are free to change this WR structure.
If you can to see how this is done, you need to check the code of the low-level drivers...
Thanks
Dotan
Hi Dotan,
Your site is a huge help!
Regarding reuse of WR, are the ibv_sge elements copied as well?
From my reading of the code they are copied but can i reuse them when ibv_post_send returns?
Also is there a restriction on multiple WR with the same wr_id?
For example can the same id be used to identify a chain of WR posted together?
Thanks!
Thanks!
Yes. The s/g list is being copied to the QP's Send Queue and they can be reused.
About the wr_id; it is a user defined private data and can contain any value that you wish..
(including multiple WRs with the same wr_id).
Sure
Dotan
Hi
thx for your previous answers.
I was wondering: Is there a performance difference between IBV_WR_RDMA_WRITE(_WITH_IMM) and IBV_WR_SEND(_WITH_IMM) ?
Also is there any advantage of having the remote post IBV_WR_RDMA_READ instead of posting IBV_WR_RDMA_WRITE(_WITH_IMM)/IBV_WR_SEND(_WITH_IMM) locally?
thx
Bernard
Hi Bernard.
In the following post you'll find most of your answers:
Tips and tricks to optimize your RDMA code
However, I'll answer your questions shortly:
Yes, there is a performance difference, so one should prefer using RDMA Write with immediate instead of Send with immediate.
RDMA Read is considered more "expensive" than RDMA Write or Send operations, so one should prefer the later operations.
I hope that I helped
Dotan
Hi Dotan,
This is a fantastic website for RDMA learners! I have a question regarding on the atomic operations. That is, how does the RDMA atomic operations (FetchAdd & CmpSwap) implemented? I guess there should be a locking mechanism that comes to work once the atomic operations are performed on some memory buffer. Is the lock implemented on the network (RNIC?), on the specific memory buffer, on the memory bus, or somewhere else?
Thanks in advance!
Henry
Hi Henry.
Thanks for the compliment.
:)
The atomic operations are atomic related to other atomic operations and not to any other operation or any other memory access.
I don't *know* the internal implementation but I can guess;
It depends of the supported atomicity level of the RDMA device:
* If it is supports atomicity within the device - it may have an internal mechanism to prevent other atomic access to this memory
* If it is supports atomicity between other devices - I guess that it will lock the bus or something like this.
AFAIK, every atomic is supported until now only within the device.
I hope that this answer helped you.
Thanks
Dotan
Hi Dotan,
> The atomic operations are atomic related to other atomic operations and not to any other operation or any other memory access.
Do you mean that if one modifies a remote value with eg. IBV_WR_ATOMIC_FETCH_AND_ADD, this modification will *not* appear as atomic for any other software (eg. running locally on that machine) that attempts to read this memory location?
Hi Igor.
Here is the exact quote from the InfiniBand specifications:
"o9-17: Atomicity of the read/modify/write on the responder’s node by the
ATOMIC Operation shall be assured in the presence of concurrent atomic
accesses by other QPs on the same CA."
It specifies how the RDMA device will handle the content of the memory and doesn't really mention other interfaces (such as the software). For example: it *may* perform the following: Read, modify, write and perform the write 10 seconds after the read happened. During this time, the RDMA device will prevent any access to this memory by other Atomic operations. The (local) software doesn't really aware to the operations that are done by the RDMA device...
Thanks
Dotan
Hi Dotan
I use ibv_post_send(), doing RDMA write, I found that if the num_sge is 4, it return -1; if the num_sge is 2 or 1 , it works fine. (the buffer is 4kB each).
How can I make it send 4(or more) num_sge buffers?
Thanks.
Zhang Yue
Hi Zhang Yue.
Can you send the output of:
ibv_devinfo | grep max_sge
Thanks
Dotan
hi Dotan,
The command output is these:
root@ubuntu-earth:/home/zhangyue/t0_src/stgt/perf_test/zy_target# ibv_devinfo | grep max_sge
root@ubuntu-earth:/home/zhangyue/t0_src/stgt/perf_test/zy_target# ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.32.5100
node_guid: f452:1403:0028:0820
sys_image_guid: f452:1403:0028:0823
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: MT_1090120019
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 3
port_lid: 4
port_lmc: 0x00
link_layer: IB
port: 2
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2
port_lmc: 0x00
link_layer: IB
root@ubuntu-earth:/home/zhangyue/t0_src/stgt/perf_test/zy_target#
hi Dotan
I found that the queue pair config limits it:
qp_init_attr.cap.max_send_sge = 1; /* scatter/gather entries */
qp_init_attr.cap.max_recv_sge = 1;
I changed 1 to 16 and works.
Thanks, you are nice.
Zhang Yue
Hi Zhang Yue.
Thanks for the update.
I've updated the description of num_sge in the posts that describe the structures of Send Request and Receive Request to be more informative according to your problem.
Thanks
Dotan
In a UD QP, can you post an inline send with immediate data?
Yes, you can.
Thanks
Dotan
Hi Dotan,
I'd like to consult with you on the following subject: we perform IBV_WR_RDMA_WRITE to a remapped BAR of a remote PCI device and experience poor throughput. Using hardware monitoring tools we found out that the data was being written in 64-byte packets, and that's what cased the above issue.
My question is whether there's any configuration that could affect the way HCA writes the data?
I post a non-signalled rdma-write, with >1K of data as a single SGE, 4K MTU.
Hi Igor.
I'm sorry, but this is device specific and I don't know much about it.
However, I would check with the vendor of that PCI device to get more details.
Do you have performance problems when accessing the PCI device locally?
Maybe the way that this BAR is mapped to kernel can be improved?
I hope that this give you a hint...
Thanks
Dotan
It's "device-specific" in the sense that writing 64-byte packets causes the device to get the data slowly (which doesn't happen when HCA writes to RAM, or when we DMA'ing to this PCI device by other means) - the device vendor assured this assumption.
The BAR is remapped to a user-space virtual addresses with io_remap_pfn_range(), then registered as rdma memory-region using PeerMemory mechanism recently introduced in Mellanox OFED especially for this purpose.
I believe the remote (w.r.t to the PCI device) HCA sends the data over the fabric in MTU-sized chunks, so it's probably the local HCA that performs such a "slow", or PCI-unfriendly, DMA.
So, the question is whether we have any control over the way HCA performs the DMA?
Hi Igor.
AFAIK, there isn't any way to control the HCA performs the DMA.
I doubt it, but even if there are ways to do this; you'll need to get this info from the HW vendors..
Sorry.
Dotan
Hi,
Suppose I post two request in the receive queue but for some reason I received the data for second request before first request. Is it possible to receive data for second request before first or it will always give error.
Hi Govind.
You have two Receive Requests in your Receive Queue
(the Receive Queue "knows" only the order of the posting of those Receive Request,
and this ordered is promised).
The next message that will enter to the Queue Pair that will consume a Receive Request will take
those Receive Requests according to the order that they were enqueues to it.
I understand that your application has the semantics of the first and second one,
however, the RDMA doesn't.
Bottom line, the answer is: no.
BTW it should always give an error. You didn't give me enough info,
but I believe that the problem is that the "first" Receive Request is small.
This can be fixed by making sure that all the Receive Requests can hold all the incoming messages ...
I hope that this helps you
Dotan
hii all,
during ibv_post_send I am getting errno 0 and 2 for two different messages. Can someone please point out to some document where I can find description of errno. I am using OFA RDMA api's
Hi.
Unfortunately, the errno return values isn't consistent for all low-level drivers in RDMA.
If you'll share the code, maybe I'll be able to answer you.
Thanks
Dotan
Hello,
I can successfully send RDMA READ/WRITE, but I can't get RDMA atomic operations to work. I get an error when calling ibv_post_send function in the client, and the errno will be set to "Invalid Arguments.". Below I pasted important parts of my code. Could you please check my code and let me know if I'm missing anything?
*********** client side *****************:
-- Registering the memory regions --
mr = ibv_reg_mr(pd, buff, size, IBV_ACCESS_LOCAL_WRITE);
// and the size is 8
if (!mr){
fprintf(stderr, "Error, memory registration failed\n");
return -1;
}
-- Preparing RDMA ATOMIC FETCH AND
struct ibv_send_wr wr, *bad_wr = NULL;
struct ibv_sge sge;
memset(&sge, 0, sizeof(sge));
sge.addr = buff;
sge.length = 8;
sge.lkey = mr->lkey;
memset(&wr, 0, sizeof(wr));
wr.wr_id = 0;
wr.opcode = IBV_WR_ATOMIC_FETCH_AND_ADD;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.atomic.remote_addr = remote_buffer;
wr.wr.atomic.rkey = peer_mr->rkey;
wr.wr.atomic.compare_add = 1ULL; /* value to be added to the remote address content */
if (ibv_post_send(qp, &wr, &bad_wr)) {
fprintf(stderr, "Error, ibv_post_send() failed\n");
return -1;
}
********* End of Client side *******
****** Server side ****************
-- Registering the memory regions --
mr = ibv_reg_mr(pd, rdma_region_timestamp_oracle, sizeof(TimestampOracle),
IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC ));
if (!mr){
fprintf(stderr, "Error, memory registration() failed\n");
return -1;
}
NOTE: TimestampOracle is a class with two int members, so its size is 8 bytes (satisfies 64-bit condition for RDMA ATOMIC operations)
Thank you for your helps,
Erfan
Hi Efran.
I have some questions:
1) Did you check that the RDMA device supports Atomic?
2) Did you check that the remote address is 8 byte aligned?
3) Did you enable atomic at the responder QP?
4) Is this is an RC QP?
I hope that one of the above questions gave you a hint on the problem.
If not, I'll need to see more source code and information on the RDMA devices that you are using.
Thanks
Dotan
Hello Dotan,
Thank you for your response. I'll try to address your questions as far as my understanding
1) How can I check that? Do you mean that some RDMA devices support Atomic and some don't?
2) I simplified the code, so now the remote address is one (long long) variable, which is 8 bytes (I paste the code at the end of this comment).
3) As you can see in my previous comment, on the server side code, I registered the memory region to be able to be accessed atomically by ibv_reg_mr(pd, ... , ...,
IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC )). Do I need to do anything other than that?
4) When initializing the queue pairs on both client and server, I used qp_attr->qp_type = IBV_QPT_RC.
Here's the simplified code, I tried to leave unrelated parts out. I know how annoying it can be to read somebody else's lousy code. I'd really appreciate your help.
******** client code **********
void build_qp_attr(struct ibv_qp_init_attr *qp_attr){
memset(qp_attr, 0, sizeof(*qp_attr));
qp_attr->send_cq = s_ctx->cq;
qp_attr->recv_cq = s_ctx->cq;
qp_attr->qp_type = IBV_QPT_RC;
qp_attr->cap.max_send_wr = 10;
qp_attr->cap.max_recv_wr = 10;
qp_attr->cap.max_send_sge = 1;
qp_attr->cap.max_recv_sge = 1;
}
void register_memory(struct connection *conn) {
local_buffer = new long long[1];
local_mr = ibv_reg_mr(pd, local_buffer, sizeof(long long), IBV_ACCESS_LOCAL_WRITE));
}
void on_completion(struct ibv_wc *wc){
struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id;
// Assume that the client already knows about the remote_mr on the server side
if (wc->opcode & IBV_WC_RECV) {
struct ibv_send_wr wr, *bad_wr = NULL;
struct ibv_sge sge;
memset(&sge, 0, sizeof(sge));
sge.addr = (uintptr_t)local_buffer;
sge.length = sizeof(long long);
sge.lkey = local_mr->lkey;
memset(&wr, 0, sizeof(wr));
wr.wr_id = 0;
wr.opcode = IBV_WR_ATOMIC_FETCH_AND_ADD;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.atomic.remote_addr = (uintptr_t)remote_mr.addr;
wr.wr.atomic.rkey = remote_mr.rkey;
wr.wr.atomic.compare_add = 1ULL;
if (ibv_post_send(qp, &wr, &bad_wr)) {
fprintf(stderr, "Error, ibv_post_send() failed\n");
die();
}
}
}
***** End of client code ********
**** Serve code ******
struct connection {
struct rdma_cm_id *id;
struct ibv_qp *qp;
struct ibv_mr *mr;
long long *rdma_buffer;
};
void build_qp_attr(struct ibv_qp_init_attr *qp_attr) {
memset(qp_attr, 0, sizeof(*qp_attr));
qp_attr->send_cq = s_ctx->cq;
qp_attr->recv_cq = s_ctx->cq;
qp_attr->qp_type = IBV_QPT_RC;
qp_attr->cap.max_send_wr = 10;
qp_attr->cap.max_recv_wr = 10;
qp_attr->cap.max_send_sge = 1;
qp_attr->cap.max_recv_sge = 1;
}
void register_memory(struct connection *conn){
rdma_region = 1ULL;
rm = ibv_reg_mr(pd, rdma_buffer, sizeof(long long), IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC ));
}
***** End of Server code *******
1) In struct ibv_device_attr, there is an attribute called 'atomic_cap'.
This describe the atomicity support level of this device.
Since there may be devices that don't support atomic operations.
For more information, please read the post of ibv_query_device().
(Can you tell me what is its value?)
2) Please check the remote address value, that it is 8 byte aligned
(Can you tell me what is its value?)
3) When calling ibv_modify_qp, there is an attributes in struct ibv_qp_attr called 'qp_access_flags',
did you enable IBV_ACCESS_REMOTE_ATOMIC in the receiver side?
For more information, please read the post on ibv_modify_qp().
4) Only RC QP supports Atomic, so I see that you are using it.
And it's o.k., I don't mind read other people code :)
(I'm doing it all the time).
The code looks fine, beside from my comments above.
If you'll can send me in email (dotan at rdmamojo dot com) :
1) The full source code
2) The parameters of your program
3) Execution example and output of your program
3) The output of 'ibv_devinfo -v'
I'll be able to help you further more
(there is a limit to what I can do with only description..)
Thanks
Dotan
Hello Dotan,
I'm trying to speedup ibv_post_send when sending inline messages by using unsignaled completions. The problem is that it doesn't work if I post more than "qp_init_attr.cap.max_send_wr" unsignaled send requests. I tried to post one signaled request every N unsignaled ones, but still crashes after max_send_wr. What am I doing wrong?
Hi Jaume.
The flow that you've described sounds valid. What do you mean by "still crashes"?
(Since i don't expect to get a crash in this flow, unless there is a bug).
Did you provide a valid bad_wr pointer to the ibv_post_send () verb?
Thanks
Dotan
By crashing, I meant that ibv_post_send fails. I do not want to spend time reading the completions, so I send an "unsignaled" message. However, it seems that the unsignaled does not work because send fails once the CQ gets filled up. The QP is created with "qp_init_attr.sq_sig_all = 0;" and messages sent without the IBV_SEND_INLINE flag.
"Unsignaled Work Requests" mean that those Send Requests won't generate Work Completions.
However, they are still consider outstanding. Which means that you need to empty the Send Queue
by sending signaled Send Requests from time to time
(otherwise, the Send Queue will be full, and you won't be able to post any new Send Requests).
The IBV_SEND_INLINE isn't relevant to the signalling of the Send Requests.
Bottom line, from time to time, you must post signaled Send Requests
(if the Send Queue size is N, you can post signaled Send Requests every N messages,
and by polling its Work Completion, you'll empty the Send Queue).
Thanks
Dotan
Jaume, note that you have to process completions in the completion queue.
Hii Dotan,
I am trying to post send request in a queue that is already full. and I am getting some error (ENOMEM). So I put some sleep time and again post same request but it is again throwing same error. (Consider that after sleep time send queue is not full)
Hi. Govind.
Did you poll some Work Completions (which were posted to that Work Queue) from the associated CQ during this time?
Thanks
Dotan
yes, i did and and I am getting error there also... Currently I solved these issue by checking the number of pending request (using your idea that u mentioned in one of the comment) in send queue before posting any request and it is working but I don't want to do that because of the performance issue. and one more thing how I should increase the maximum limit of pending request in the queue and thanks for the all the help and suggestions, I really appreciate it.
I'm glad that i can help.
:)
Which error do you get.?
Can you share the source code?
It will be easier for me to help you with source in front of me. .
Thanks
Dotan
Hi Dotan,
I can't share the code (confidentiality issues), but I can tell u the error number, first error which I am getting having error number 12 and then after error number 5 for all the other messages during polling of CQ?? Can you please tell me how to increase the maximum limit of pending request in queue. Currently I am able to post ~8192 requests.
Hi Govind.
When calling ibv_create_qp(), you control the Send Queue (please refer to the post on this verb for more information).
I suspect that you have completion with error (i.e. the 5 and 12 errors that you reported).
Am I right? (are those are the status values of the Work Completion that you polled?)
If this is the case, completion status 12 = IBV_WC_RETRY_EXC_ERR which means that the remote side didn't answer within the expected time.
Thanks
Dotan
Hii Dotan,
First of all thanks for all your help, Finally my code is working and currently I am getting 3 times better performance for RDMA compare to UDP. I am having few more question that how much improvement(max) can we suppose with RDMA as compare to udp. Currently I am using only channel semantics, is there any good chances to improve if I use memory semantics also??
Hi Govind.
I'm happy that I can help
:)
1) Performance is a very big area. Which metrics do you check? what is the current numbers in UDP?
Do you compare usin RC QP/UD QP? Which operations do you use?
2) What do you mean by channel semantics and memory semantics?
Thanks
Dotan
I am using RC QP and compairing with UDP protocol on the basis of waiting time of requested data.
With memory semantics I mean that I am not allowing the remote node channel adapter to write directly to host memory using rkey (all read write operation are done by local channel adapter by using lkey) and the reason for using only channel semantics is that I am transferring very small amount of data at a time.
So, I guess that your metric is latency.
I suggest that you'll execute a tool that comes with the OFED package called ib_send_lat,
which will provide you the (best) latency that you can achieve using SEND operations in your setup.
The performance depends on so many factors, so I prefer not to provide a number.
Thanks
Dotan
hi Dotan
(at the Target side) When I'm doing a RDMA-READ with 4 wr, each wr have 1 sge (4KB), the initiator will easyly crush or the /dev/sdxx dispear. (While doing RDMA-WRITE is fine.)
I've set the wr's rkey and increase remote_addr by 4096, any suggest?
Thanks
Zhang Yue
ps:
for(k = 1; k cache_req.sglist_size; k++)
{
multi_wr[k] = rdmad->send_wr; // copy struct
multi_wr[k].next = &multi_wr[k+1];
multi_wr[k].sg_list = &task->rdma_sge[k];
multi_wr[k].send_flags = 0; //zy: should be 0. otherwize will free task multi times
multi_wr[k].wr.rdma.remote_addr += (4096 * k);
task->rdma_sge[k].addr = tgt_phy2virt(task->cache_req.sglist[k].addr);
task->rdma_sge[k].length = task->cache_req.sglist[k].len;
task->rdma_sge[k].lkey = get_cache_buf_lkey(task->conn->dev, task->cache_req.sglist[k].addr);
}
// insert to list
multi_wr[k-1].next = rdmad->send_wr.next;
rdmad->send_wr.next = &multi_wr[1];
task->task_multi_wr = multi_wr;
//this sge.length mark the total length, will be use at iser_rdma_rd_comp_complete_handler
rdmad->sge.length = task->rdma_rd_sz;
// so we need to place the first wr's sge to other place
rdmad->send_wr.sg_list = task->rdma_sge;
task->rdma_sge[0].addr = tgt_phy2virt(task->cache_req.sglist[0].addr);
task->rdma_sge[0].length = task->cache_req.sglist[0].len;
task->rdma_sge[0].lkey = get_cache_buf_lkey(task->conn->dev, task->cache_req.sglist[0].addr);
Hi.
I don't know if this is related to RDMA.
I would suggest to check if the local buffer that is being filled
is still allocated or being freed.
Maybe you should print the local address and check if the values make any sense.
Please check that before using the values the Work Completion status is o.k.
Thanks
Dotan
Hi Dotan
Firstly, may all of us Merry Christmas!
Yes,this issuse is NOT related to RDMA.
Yesterday, I print every wr before calling ibv_post_send(), and found a issues:
After doing a lot of 16KB write, tgt may receive a INQUIRY, and if the INQUIRY unluckily use a task struct that was previously used by a 16kB write( or read),
It will use the old 4 4KB buffers and DMA to the initiator. INQUIRY only read 70 bytes, DMA 16 KB to it will broke the initiator's memory.
The main fix is: check the need DMA length, if <=0 , skip the left buffers.
Thanks
Zhang Yue
Hi.
Merry Christmas indeed
:)
I'm happy that you found the problem.
Dotan
Hi Dotan
I am trying to use IBV_WR_ATOMIC_CMP_AND_SWP operation and I get some error like this when I poll the wc :IBV_WC_REM_ACCESS_ERR
I just make some simple modification base on the codes provided in the book “RDMA_Aware_Programing_user_manual”, do you know what is the problem?
Hi.
Please check that IBV_ACCESS_REMOTE_ATOMIC is enabled in the remote memory buffer and in the remote QP.
Thanks
Dotan
Hi Dotan,
I want to post a request, but I want that the remote QP discards this request as soon as it receives it. This is because I want to send a dummy packet when I am in the REARM state in the QPs in order to reach the ARMED state (this is because it is needed an incoming packet for this transition).
I am using the below configuration and it seems to be working, but I would like to know if you think that this could be a generic approach for any situation or not:
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
memset(&wr, 0, sizeof(wr));
wr.wr_id = 0;
wr.sg_list = NULL;
wr.num_sge = 0;
wr.opcode = 0;
wr.send_flags = 0;
if (ibv_post_send(ctx->id[num_qp]->qp, &wr, &bad_wr)) {
fprintf(stderr, "Error, ibv_post_send() failed\n");
return -1;
}
Best regards,
Jesus Camacho
Hi Jesus.
You are sending a "standard" zero message. This can work, but you consume a Receive Request in the remote side.
Did you consider sending a zero message RDMA Write?
Thanks
Dotan
Hi Dotan,
I am currently using the opcode 0 (which is the IBV_WR_RDMA_WRITE operation) and it is working fine with the Infiniband microbenchmarks.
Is that what you are suggesting me?
If so, do you think this can be extrapolated to any scenario?
Thanks for your time,
Jesus
Hi.
Yes, this is was my suggestion.
What do you mean by "do you think this can be extrapolated to any scenario"?
Thanks
Dotan
Hi,
I mean if this is a general solution.
Do you think that this is going to work when using another benchmarks, applications, etc.?
Best,
Jesus
Hi.
Yes. Using zero bytes message is valid and can be always used.
Working with such messages with RDMA Write opcode can provide better performance than the Send opcode.
Thanks
Dotan
Hi,
good to know!
Thanks for your help :-)
Jesus
Sure
:)
Dotan
Hello Dotan,
I have a quick question. What happens if the local node calls ibv_post_send() with opcode ibv_wr_send before the remote node calls ibv_post_recv()?
Thanks!
Hi John, the answer won't be quick though
;)
The thing that matter is not when the sides posted the Send/Receive request in absolute time;
since one may not know when the actual scheduling of the Send Request will take place...
If message that consumes a Receive Request received by a Queue Pair when there isn't any available Receive Request in that Queue,
and RNR (Receive Not Ready) flow will start for a Reliable QPs. For Unreliable QPs, the incoming message will be (silently) dropped.
Thanks
Dotan
Hello Dotan,
Thanks for the quick reply!
I am using a Reliable QP. So I think I will get the RNR errors. Now I have a couple of choices. (a) when getting a RNR error, back off and re-post the send request later; (b) implement a flow control protocol so that the local node posts send requests only when the remote node is ready. I like (b) more than (a). But (b) add complexity, and need to take care cases such as both nodes are waiting for the other side to become ready. :-)
So I am wondering if there is a common practice.
Thanks!
Sure :)
In RNR flows, the problem is that the receiver side doesn't post Receive Requests fast enough ..
About your suggestions:
a) When you have an RNR error, your local QP is in ERROR state, so you can't post another Send Request without reconnecting it with the remote QP.
b) is a good idea
There are more options:
* You can increase the RNR timeout
* You can increase the RNR retry count (the value 7 means infinite retries)
* If you have several QPs at the receiver side, you can use a SRQ and make sure that the SRQ is never empty
(the SRQ LIMIT mechanism can help you to detect if the number of Receive Requests dropped bellow a specific watermark)
Adding flow control to your messages is always a good idea in order to not enter to the RNR flow in the first place ..
Thanks
Dotan
Hi Dotan,
I have few questions related to connection of RC queue pair.
1. If ibv_post_send fails then we consider connection was lost.
-> considering all the fields in the message are correct and the send queue is not full. Is vice versa also true that if we are able to post means there is working connections b/w nodes.
2. Is it possible that we receive send WC with some error if there is active or working connection between nodes assuming message was correct and receiver also posted recv request (no RNR error).
3. If we post send request beyond max limit in the send queue then it will corrupt the queue pair and no further request post allowed ? If no then can we post same request again without any change ?
Hi.
1. Failure of ibv_post_send() means that one of the Send Requests is invalid or the Send Queue is full;
it doesn't mean that connection is closed. In that case no new Send Request was added to the Send Queue.
You can post Send Request to a Queue Pair which was configured with bad remote attributes
("bad" means not the attributes that you should have been configured...), i.e. no connection.
2. In general, no; but this question is tricky...
Which completion status did you get?
3. If you posted Send Requests beyond the maximum limit and all of them are unsignaled - you have a problem.
The Queue Pair isn't corrupted, but you can't post anymore Send Requests to it:
The status of the outstanding Send Requests is undetermined for the sender side.
The Receive Side of this Queue Pair is still fully operational.
You must recover it but moving it to Error/Reset state and reconnect the Queue Pairs
I hope that I helped you
Dotan
Hi Dotan:
Nice to meet you. I'm from China. My English is not very good. Recently I have learn somthing about RDMA. But I met a problem:
This is my test program:
server code :
/*
* Copyright (C) fuchencong@163.com
*/
#include
#include
#include
#include
#include
#include
#define VERB_ERR(verb, ret) \
fprintf(stderr, "%s returned %d errno %d\n", verb, ret, errno)
#define MB 1024 * 1024
/* Default parameter values */
#define DEFAULT_PORT "51216"
#define DEFAULT_MSG_COUNT 100
#define DEFAULT_MSG_LENGTH MB
/* Resources used in the example */
struct context
{
char *server_name;
char *server_port;
unsigned int msg_count;
unsigned int msg_length;
/* Resources */
struct rdma_cm_id *id;
struct rdma_cm_id *listen_id;
struct ibv_mr *recv_mr;
char *recv_buf;
};
int
reg_mem(struct context *ctx)
{
ctx->recv_buf = (char *) malloc(ctx->msg_length);
memset(ctx->recv_buf, 0x00, ctx->msg_length);
ctx->recv_mr = rdma_reg_msgs(ctx->id, ctx->recv_buf, ctx->msg_length);
if (!ctx->recv_mr) {
VERB_ERR("rdma_reg_msgs", -1);
return -1;
}
return 0;
}
int
getaddrinfo_and_create_ep(struct context *ctx)
{
int ret;
struct rdma_addrinfo *rai, hints;
struct ibv_qp_init_attr qp_init_attr;
memset(&hints, 0, sizeof (hints));
hints.ai_port_space = RDMA_PS_TCP;
hints.ai_flags = RAI_PASSIVE; /* this makes it a server */
printf("rdma_getaddrinfo\n");
ret = rdma_getaddrinfo(ctx->server_name, ctx->server_port, &hints, &rai);
if (ret) {
VERB_ERR("rdma_getaddrinfo", ret);
return ret;
}
memset(&qp_init_attr, 0, sizeof (qp_init_attr));
qp_init_attr.cap.max_send_wr = 1;
qp_init_attr.cap.max_recv_wr = 1;
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;
printf("rdma_create_ep\n");
ret = rdma_create_ep(&ctx->id, rai, NULL, &qp_init_attr);
if (ret) {
VERB_ERR("rdma_create_ep", ret);
return ret;
}
rdma_freeaddrinfo(rai);
return 0;
}
int
get_connect_request(struct context *ctx)
{
int ret;
printf("rdma_listen\n");
ret = rdma_listen(ctx->id, 4);
if (ret) {
VERB_ERR("rdma_listen", ret);
return ret;
}
ctx->listen_id = ctx->id;
printf("rdma_get_request\n");
ret = rdma_get_request(ctx->listen_id, &ctx->id);
if (ret) {
VERB_ERR("rdma_get_request", ret);
return ret;
}
if (ctx->id->event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
printf("unexpected event: %s", \
rdma_event_str(ctx->id->event->event));
return ret;
}
return 0;
}
int
establish_connection(struct context *ctx)
{
int ret;
struct rdma_conn_param conn_param;
/* post a receive to catch the first send */
ret = rdma_post_recv(ctx->id, NULL, ctx->recv_buf, ctx->msg_length,
ctx->recv_mr);
if (ret) {
VERB_ERR("rdma_post_recv", ret);
return ret;
}
memset(&conn_param, 0, sizeof (conn_param));
conn_param.responder_resources = 2;
conn_param.initiator_depth = 2;
conn_param.retry_count = 5;
conn_param.rnr_retry_count = 5;
printf("rdma_accept\n");
ret = rdma_accept(ctx->id, &conn_param);
if (ret) {
VERB_ERR("rdma_accept", ret);
return ret;
}
return 0;
}
int
recv_msg(struct context *ctx)
{
int ret;
struct ibv_wc wc;
ret = rdma_get_recv_comp(ctx->id, &wc);
if (ret id, NULL, ctx->recv_buf, ctx->msg_length,
ctx->recv_mr);
if (ret) {
VERB_ERR("rdma_post_recv", ret);
return ret;
}
return 0;
}
int
main(int argc, char** argv)
{
int ret, op, i, recv_cnt;
struct context ctx;
struct ibv_qp_attr qp_attr;
memset(&ctx, 0, sizeof (ctx));
memset(&qp_attr, 0, sizeof (qp_attr));
ctx.server_port = DEFAULT_PORT;
ctx.msg_count = DEFAULT_MSG_COUNT;
ctx.msg_length = DEFAULT_MSG_LENGTH;
while ((op = getopt(argc, argv, "a:p:c:l:")) != -1) {
switch (op) {
case 'a':
ctx.server_name = optarg;
break;
case 'p':
ctx.server_port = optarg;
break;
case 'c':
ctx.msg_count = atoi(optarg);
break;
case 'l':
ctx.msg_length = atoi(optarg) * MB;
break;
default:
printf("usage: %s [-s or -a required]\n", argv[0]);
printf("\t[-a ip_address]\n");
printf("\t[-p port_number]\n");
printf("\t[-c msg_count]\n");
printf("\t[-l msg_length]\n");
exit(1);
}
}
printf("address: %s\n", (!ctx.server_name) ? "NULL" : ctx.server_name);
printf("port: %s\n", ctx.server_port);
printf("count: %d\n", ctx.msg_count);
printf("length: %d bytes\n", ctx.msg_length);
printf("\n");
ret = getaddrinfo_and_create_ep(&ctx);
if (ret) {
goto out;
}
ret = get_connect_request(&ctx);
if (ret) {
goto out;
}
ret = reg_mem(&ctx);
if (ret) {
goto out;
}
ret = establish_connection(&ctx);
recv_cnt = 0;
for (i = 0; i < ctx.msg_count; i++) {
if (recv_msg(&ctx)) {
break;
}
++recv_cnt;
}
printf("recv %d messages, each message is %d bytes\n", \
recv_cnt, ctx.msg_length);
rdma_disconnect(ctx.id);
out:
if (ctx.recv_mr) {
rdma_dereg_mr(ctx.recv_mr);
}
if (ctx.id) {
rdma_destroy_ep(ctx.id);
}
if (ctx.listen_id) {
rdma_destroy_ep(ctx.listen_id);
}
if (ctx.recv_buf) {
free(ctx.recv_buf);
}
return ret;
}
client code:
/*
* Copyright (C) fuchencong@163.com
*/
#include
#include
#include
#include
#include
#include
#define VERB_ERR(verb, ret) \
fprintf(stderr, "%s returned %d errno %d\n", verb, ret, errno)
#define MB 1024 * 1024
/* Default parameter values */
#define DEFAULT_PORT "51216"
#define DEFAULT_MSG_COUNT 100
#define DEFAULT_MSG_LENGTH MB
#define DEFAULT_MSEC_DELAY 500
/* Resources used in the example */
struct context
{
char *server_name;
char *server_port;
unsigned int msg_count;
unsigned int msg_length;
/* Resources */
struct rdma_cm_id *id;
struct ibv_mr *send_mr;
char *send_buf;
};
int
reg_mem(struct context *ctx)
{
ctx->send_buf = (char *) malloc(ctx->msg_length);
memset(ctx->send_buf, 'a', ctx->msg_length);
ctx->send_mr = rdma_reg_msgs(ctx->id, ctx->send_buf, ctx->msg_length);
if (!ctx->send_mr) {
VERB_ERR("rdma_reg_msgs", -1);
return -1;
}
return 0;
}
int
getaddrinfo_and_create_ep(struct context *ctx)
{
int ret;
struct rdma_addrinfo *rai, hints;
struct ibv_qp_init_attr qp_init_attr;
memset(&hints, 0, sizeof (hints));
hints.ai_port_space = RDMA_PS_TCP;
printf("rdma_getaddrinfo\n");
ret = rdma_getaddrinfo(ctx->server_name, ctx->server_port, &hints, &rai);
if (ret) {
VERB_ERR("rdma_getaddrinfo", ret);
return ret;
}
memset(&qp_init_attr, 0, sizeof (qp_init_attr));
qp_init_attr.cap.max_send_wr = 1;
qp_init_attr.cap.max_recv_wr = 1;
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;
printf("rdma_create_ep\n");
ret = rdma_create_ep(&ctx->id, rai, NULL, &qp_init_attr);
if (ret) {
VERB_ERR("rdma_create_ep", ret);
return ret;
}
rdma_freeaddrinfo(rai);
return 0;
}
int
establish_connection(struct context *ctx)
{
int ret;
struct rdma_conn_param conn_param;
memset(&conn_param, 0, sizeof (conn_param));
conn_param.private_data_len = sizeof (int);
conn_param.responder_resources = 2;
conn_param.initiator_depth = 2;
conn_param.retry_count = 5;
conn_param.rnr_retry_count = 5;
printf("rdma_connect\n");
ret = rdma_connect(ctx->id, &conn_param);
if (ret) {
VERB_ERR("rdma_connect", ret);
return ret;
}
if (ctx->id->event->event != RDMA_CM_EVENT_ESTABLISHED) {
printf("unexpected event: %s",
rdma_event_str(ctx->id->event->event));
return -1;
}
return 0;
}
int
send_msg(struct context *ctx)
{
int ret;
struct ibv_wc wc;
ret = rdma_post_send(ctx->id, NULL, ctx->send_buf, ctx->msg_length,
ctx->send_mr, IBV_SEND_SIGNALED);
if (ret) {
VERB_ERR("rdma_send_recv", ret);
return ret;
}
ret = rdma_get_send_comp(ctx->id, &wc);
if (ret < 0) {
VERB_ERR("rdma_get_send_comp", ret);
return ret;
}
return 0;
}
int
main(int argc, char** argv)
{
int ret, op, i, send_cnt;
struct context ctx;
struct ibv_qp_attr qp_attr;
memset(&ctx, 0, sizeof (ctx));
memset(&qp_attr, 0, sizeof (qp_attr));
ctx.server_port = DEFAULT_PORT;
ctx.msg_count = DEFAULT_MSG_COUNT;
ctx.msg_length = DEFAULT_MSG_LENGTH;
while ((op = getopt(argc, argv, "a:p:c:l:")) != -1) {
switch (op) {
case 'a':
ctx.server_name = optarg;
break;
case 'p':
ctx.server_port = optarg;
break;
case 'c':
ctx.msg_count = atoi(optarg);
break;
case 'l':
ctx.msg_length = atoi(optarg) * MB;
break;
default:
printf("usage: %s [-s or -a required]\n", argv[0]);
printf("\t[-a ip_address]\n");
printf("\t[-p port_number]\n");
printf("\t[-c msg_count]\n");
printf("\t[-l msg_length]\n");
exit(1);
}
}
printf("address: %s\n", (!ctx.server_name) ? "NULL" : ctx.server_name);
printf("port: %s\n", ctx.server_port);
printf("count: %d\n", ctx.msg_count);
printf("length: %d bytes\n", ctx.msg_length);
printf("\n");
if (!ctx.server_name) {
printf("server address must be specified for client\n");
exit(1);
}
ret = getaddrinfo_and_create_ep(&ctx);
if (ret) {
goto out;
}
ret = reg_mem(&ctx);
if (ret) {
goto out;
}
ret = establish_connection(&ctx);
send_cnt = 0;
for (i = 0; i < ctx.msg_count; i++) {
if (send_msg(&ctx)) {
break;
}
++send_cnt;
}
printf("send %d messages, each message is %d bytes\n", \
send_cnt, ctx.msg_length);
rdma_disconnect(ctx.id);
out:
if (ctx.send_mr) {
rdma_dereg_mr(ctx.send_mr);
}
if (ctx.id) {
rdma_destroy_ep(ctx.id);
}
if (ctx.send_buf) {
free(ctx.send_buf);
}
return ret;
}
What I can't understand is that sometimes this program takes 1 minite to send 1G data and sometimes it only needs 0.2 seconds。 So it's not very stable.
I really don't know why. Can you give me some advice?
Thank you!
Hi.
The code that you sent me is corrupted (problem to be added in a comment).
Can you please send it to me?
dotan at rdmamojo dot com
Thanks
Dotan
Hi Dotan,
Thanks for the quick reply! I have send my code to you by email.Thank you very much.
Hi ChenCong Fu.
As wrote in mail, the problem is that the Sender Queue Pair enters Receiver Not Ready (RNR) flow,
which harms the performance and this is what you sometimes see.
Thanks
Dotan
Hello Dotan,
Thanks a lot for your help.
I have a design questin, would you mind take a look at?
I have client and server, client wants to send a lot of data to server, instead of using "send" operation to send data directly from client to server, client register a memory region includes these data and use "send" operation to tell the remote server the virtual address of these data. Once the server receive this request from client, server will post an "RDMA Read" operation to read these data directly from client side.
What's the best way to do it?
because at beginning, server needs to receive a so called "rdma msg" from client, so server will be able to know where to read data at remote side(client), which means we need to put our "RDMA Read" operation inside of "receive completion hander" at server side, only when sever finishes receiving the "rdma msg" from client, server will be able to know where to read and starts "read" operation.
Is it OK to put "RDMA Read" operation inside of "receive completion handler"? Do you have any advise for this design?
Thanks a lot for your time!
All the best
Jack
Hi Jack.
I'm glad to help where I can
:)
I would suggest to use RDMA Write to send data instead of RDMA Read,
i.e. the server allocates blocks and advertise its attributes to the client
and the client will initiate an RDMA Write(s).
The last RDMA Write can be with immediate, to let the server know that it was the last message
(or from time to time during the messages as a keep alive messages and let the server know how many
messages it expects to get).
Did I answer your questions?
Thanks
Dotan
Thanks a lot Dotan!
Thanks a lot Dotan!
I will try to do both write and read.
While I am implementing it. I found out a weird situation. I am trying to put client and server both on the same machine and perform RDMA Read operation between them. The receiver(reader) can only read half part of data from sender.
For example, the sender send a packet to receiver(contains the address that the reader will read from), assuming that there're 100 bytes in that address, the receiver(reader) can only read first 50 bytes data correctly from the sender side(If sender sends 16 bytes, then only 8 bytes can be read). It's pretty weird. Because I have already tested rdma send/receiver operation, they are fine(in a loopback), which means DMA works OK.
Do you have any idea? I have updated my firmware to the newest one(May, 2015), my device is ConnectX3. Does it support to perform RDMA Read operation in a local loopback?
Thanks a lot!
Jack
Hi Jack.
I would have double check the length of the S/G entries in your Send Requests.
Thanks
Dotan
Hello Dotan,
Thanks for your help. I have checked the S/G entries length, they are enough for the requests(these entries length are equal to the bytes of data).
I don't know what to do?
All the best
Jack
Thanks Dotan, I figured it out. Something wrong in another module...
Great!
As I said, the RDMA device you mentioned works great (I worked/working with it personally).
:)
Thanks
Dotan
Hello Dotan,
I want to ask a question.
If we want to send a huge message via post_send that reqiures more than one work request(we will use send work request list).
For example, we have a send workrequest list that contains 2 work request(sendwr0, sendwr1)
for sendwr0 and sendwr1,
1) do I need to assign them the same workrequestID because they basically represent the same message?
2) About send flag, do I only need to assign send_flag_signaled on the last request(in the case above, it's sendwr1)?
1) No, you don't *need* to do it, but you *can* do it.
wr_id is the application attribute for use (or not use).
If your application needs to know that the two Work Completion are of the same message, you can use it as a hint.
2) You can set the SIGNALED flag to the second Send Request and get one Work Completion if everything will be fine.
The RDMA stack doesn't know (or care) that you used two Send Requests for one application message
(from the RDMA stack point of view, you have two different messages).
Thanks
Dotan
Thanks a lot Dotan, that's helpful!
Hello Dotan,
I would like to confirm if my understanding about FRWR is correct.
If we have sender and receiver(reader), before they can start, the sender needs to do "post_send()" twice, right? The first "post_send" is register the memory(FRWR) to the NIC, the second one is actually transfer the virtual address of these FRWR memory regions.
How many "post_send" the receiver(reader) should do? maybe "Three"?
1) "post_send" FRWR to store the incoming data
2)"post_send" to actually read the data
3) "post_send" to tell the remote side(sender) to invalidate the memory region(if receiver finishes reading)
Is that correct?
And how could we suppose to know how many FRWR read operations can be performed currently before we invalide the first FRWR? by using query device, I could not find this information, would you mind give me a hand?
All the best
Jack
Hi Jack.
I don't have any experience with FRWR operations. But let me try to help you anyway.
I assume that you are using RDMA Read (although you didn't wrote it..); this is the reason for the second post send.
According you your scenario (using RDMA Read), yes - three post_sends are needed.
I don't really understand what is do you mean by:
"...how many FRWR read operations can be performed currently before we invalidate the first FRWR".
Can you please explain it?
Thanks
Dotan
Thanks a lot Dotan!
"...how many FRWR read operations can be performed currently before we invalidate the first FRWR".
Because for FRWR(at least from my understanding), we registered a memory region and then we use it and then we invalidate it.
So for increasing performance, the receiver(reader) may perform a couple of Read operations currently, so receiver(reader) will need to invalidate that specific FMR when it's done, so my question was actually about how many Read operations we can perform, so I think it should depend on my system.
Do you know where I can find more info about FRWR? I tried to search online, but I could not find too much info.
Yes. It is your decision when to invalidate this Memory Region.
AFAIK, the InfiniBand specifications is the only place that you can get information on FRWR.
Thanks
Dotan
Hello Dotan,
If I have a very huge size of data(it's divided into multiple chunks) want to send out, there're two possible ways of doing it.
First one is using one work request (but need extra CPU time to do mem copy)
Second one is using multi rdma work request(don't need extra CPU time do mem copy but needs to post multiple work request).
Which one is better?
All the best
Jingyi
Hi Jingyi.
You can use one Send Requests with a scatter list;
this way you'll be able to eliminate the need to perform mem copy and send message from multiple buffers.
If not, the best solution depends on the size of the total message size:
* If this is small (~ < 1KB), I think that the first one is the best. * If the total message size is big, the second approach will give you best performance. I suggest to use selective signal and create Work Completion only for the last Send Request. Anyway, if performance is highly critical, the best way is to implement both approaches and measure the results (you develop once and use many times ...) I hope that this helped you. Thanks Dotan
Hello Dotan,
Thanks a lot for your reply!
I have a idea, I am not sure if it's possible.
Suppose if sender has 10 chunks data that need to send to remote side(still the send/recv model)
The normal way to do it is the sender sends the vaddr to receiver then the receiver reads data from sender or the receiver sends its vaddr to sender then sender writes to receiver.
I was thinking if it's possible that we can perform read and write operations at the same time.
Back to our assumption, for the 1st chunck the receiver(reader) reads from the sender and at the same time the sender writes 2st chunck to receiver(reader), and for the rest chunks, we do something similar. So we can improve the speed by having both side busy, right?
Is the above approach possible? If so, I believe the chanllege we will have is the ordering issue, how can we make sure that the chuncks delievered in order? Is there any good way to do it?
All the best
Jack
Hi Jack.
Yes, RDMA Reads and Writes can happen in the same time
(obviously they are initiated by both sides).
I'm not really sure how much improvements it will give compared to the complexity
(maybe you would want to work with several QPs in parallel).
Anyway, back you your idea:
What is the meaning of order?
Each QP can place the data in a different (predefined) location,
In a Write, you specify the remote location that the data will be written to.
In Read, you specify the local location that the data will be written to.
So, at the end all the chunks can be placed in one contiguous block.
Thanks
Dotan
You only need to
Hello Dotan,
Thanks for your time!
When I am doing RDMA Write operation, I noticed an very interesting problem.
After we successfully post write work request and poll the corespoding wc. the wc.byteLen is not the valid number that we have write. In RDMA read operation, the wc.byteLen is the number of bytes we read from remote side,but in write operation, we can't relay on it. I took a look at driver, the wc.byteLen hasn't been updated in write operation(if opcode = rdma write), but it has been updated in read operation.
I also checked the infiniband specification, in the rdma write section, it says we can depend on dmaLen, the weird it didn't say anything about wc.byteLen.
Why for read operation, wc.byteLen will be updated, but for write, it will not be updated?
All the best
Jack
Hi Jack.
I *think* (since I'm not one of the IB spec authors) is that if you are the Requestor side of RDMA Write or Send, you know how much data you sent. If needed, you can maintain a local information which is associated with the Send Requests, and hold in the wr_id the pointer to it.
Thanks
Dotan
Thanks Dotan!
Actually there's another confusion in driver. If we post_send(wr), in the failure case, it seems that we still can't relay on wc.opcode, because the driver doesn't update it. Is there any design reason?
why driver doesn't need to update the wc.opcode in the failure case?
All the best
Jack
Hi Jack.
This is by design. Look at the post on ibv_poll_cq() for more details on valid attributes when Work Completion has an error.
Thanks
Dotan
Thanks for all the great info!
I didn't realize the IB verbs layer itself needs completion events created by the application layer, until I saw your response to Igor R. When I first saw the description of the dead lock when the WQ is filled with non-signaled operations, I though you were referring to the application layer SW needing completion events to keep a count of outstanding operations to make sure the WQ is never filled.
Do you know why IB verbs pushes WR flow control back into the application layer by going into the error state when the WQ fills, instead of returning EAGAIN or EWOULDBLOCK like send(), recv(), read() or write() for non-blocking I/O to a busy device?
Hi Mark.
There isn't any problem if the Send Request if full with Send Requests which one of them is Signaled (i.e. will generate a Work Completion).
The problem only exists if all the posted Send Requests are non-signaled.
Letting the low-level driver or the HW make the book-keeping of which Send Request is signaled, which isn't will decrease the performance. Since before any Send Request is posted, the low-level driver will need to check if there is a potential problem.
The application knows what it is doing, and easily can avoid getting into this pitfall.
Thanks
Dotan
Hi, Dotan.
I have a question about the parallel RDMA READ. Since RDMA is a async model, before we finished a RDMA READ, we can launch another, so there is a lot of unfinished RDMA READ at a time, the number of this RDMA READ operation may exceed the initiator_depth and responder resource. What will happen when exceed? does the NIC will launch the RDMA READ as common, or it will wait until the number of unfinished RDMA READ do not exceed?
I keep the parallel RDMA READ model in a cluster, when I do not limit the parallel number, I failed with IBV_WC_RETRY_EXC_ERR, but when I limit the number of parallel RDMA READ, I can success.
Is there any limit for parallel RDMA READ? or we should avoid this. Thanks!
Hi.
Per QP, there are attributes to number of RDMA Read + Atomic messages that can be sent in parallel.
If wrong values will be used (for example: the initiator is configured to send more READs that the destination can accept)
there will be a retry flow and the initiator side may get completion with RETRY EXCEEDED error (as you seen).
The following attributes in the device capabilities are relevant to this operation:
* max_qp_rd_atom
* max_qp_init_rd_atom
The supported number of RDMA and Atomic operations per QP (for initiator and target).
Thanks
Dotan
Thanks very much! I occurs such a problem, I use shell/python and rping to compose a RDMA shuffle cluster, that is every node run a server mode process(it uses a thread for every incoming client connection), and there is also N client mode process in every node, which will set up connection with other nodes in the cluster. Since rping is RDMA READ--ACK-- RDMA WRITE ---ACK procedure, there is only one outstanding RDMA operation at any time, but there is IBV_WC_RETRY_EXC_ERR error. In my opinion, there is should no reason to occurs this error.
By the way, when the cluster is just 15 nodes, there is no error, errors occurs when there is 30 nodes in the cluster.
Can you give some advice how to deal with this?
Hi.
The problem is that there is one more attributes 'max_res_rd_atom' - the total number of RDMA Reads and atomic that this device supports as the target,
and there isn't any sync or protocol (AFAIK) which guarantees that prevents more RDMA Reads / Atomic operations to be targeted to this value.
Thanks
Dotan
Hi Dotan,
I know it is not safe to ibv_post_recv several messages on the same address. But is it safe to ibv_post_send several messages on the same address? If so, is there any performance difference between posting the same and different?
Thanks,
Tingyu
Hi Tingyu.
The problem with posting multiple Receive Requests to the same address is that the content isn't consistent
(i.e. one cannot predict the value of the buffers since there isn't any guaranteed order between different Work Queues).
Sending multiple messages from the same address don't have this problem.
Thanks
Dotan
Hi Dotan,
Thanks for this reply! I understand
data will not be consistent, but I wonder
if RDMA allows this type of operation.
So I tested by posting several receive
requests to the same address on the
receiver side, it seems
RDMA library threw out an error during
ibv_poll_cq on the sender side, by setting
wc.status to 12. Could you explain why?
Is there any internal mechanism in RDMA library
that prevents reusing the same buffer?
Thanks,
Tingyu
Hi Tingyu.
wc.status 12 means IBV_WC_RETRY_EXC_ERR.
This means that there was a transport error at some point.
Reusing the same buffer is legal in RDMA.
Thanks
Dotan
Hi, Dotan,
does the RC QPs guarantee the ordering of RDMA_WRITE WR? For example, if an "initiator" issues 2 consecutive IBV_WR_RDMA_WRITEs into the same remote memory location will the "target" always end up with the data from the second operation (ie, the second WR will always update remote memory after the first one) ?
Hi Valentin.
I will be careful here:
* From network point of view, the first message will reach to destination before the second one.
* The memory will be DMA'ed (by the RDMA device) according to the message ordering
If the memory control, cache in the server will honor this (as I expect to be in most architectures),
I guess the answer is ""yes".
Thanks
Dotan
Hi Dotan,
Is there any limit on the maximal message size posted using
ibv_post_send? Say 16MB, 32MB, 64MB, 128MB? The problem to me
is that when I try to post message larger than 16MB, there will be
a problem (my code first posts 16MB receive request using ibv_post_receive, then posts 16MB send message using ibv_post_send
to the other side. The first posted receive buffer is to receive
the ack message from the other side). It turns out that the remote side doesn't receive the posted message (The other side also posted 16MB receive buffer before receiving message and the connection between the two has been established already). ibv_poll_cq on the sender side returns wc with status 12. Do you have any idea of this issue? I don't know how to debug this issue, could you give me any instruction on how to debug?
Thanks for help!
Tingyu
Hi Tingyu.
The maximal message size can be found in the port properties: max_msg_sz (in general, RDMA supports up to 2GB messages).
Posting bigger messages will end with completion with error.
Completion with status 12: IBV_WC_RETRY_EXC_ERR, indicate that there is a transport problem.
I suspect that the remote side isn't ready yet or finished it work and closed all the resources.
Thanks
Dotan
Hi Dotan,
Thanks. I just checked the max_msg_sz
was 2GB. To find the transport problem, I
used the example "helloworld" code on github
https://github.com/tarickb/the-geek-in-the-corner as
described by http://www.hpcadvisorycouncil.com/pdf/building-an-rdma-capable-application-with-ib-verbs.pdf.
I got the same status 12 when the message size was set as 256MB (messages with smaller size
worked).
The network I used was qlogic, so is it possible
there was something wrong with the hardware or underlying
verb implementation? Or
was there anything wrong with the infiniband setup? Do you know
the way to debug the problem?
Many thanks,
Tingyu
Hi.
I didn't work with QLOGIC HW, so I don't have any feedback to tell give you.
I would suggest to use the libibverbs example (I know them and they always work).
Thanks
Dotan
Hello Dotan,
Will work requests be modified after posting them?
In more detail: assuming a list of requests leading by wr is posted by calling ibv_post_send(qp, wr, &bad_wr); will the fields including the next pointers of the requests be modified by the library?
Thanks so much!
Jon
Hi Jon.
After a Send Request was posted, it can be modified by the application.
During post send request, the low-level library translate the libibverbs Send Request to HW-specific Send Request and "tells" the RDMA device that new SRs were posted.
Thanks
Dotan
Hi Dotan,
I was wondering what is the behavior of an RDMA read of a remote memory if the remote machine is also writing to it concurrently?
More formally, suppose host A is reading using RDMA read, a variable v which is local to host B. If the value of v before the start of the read operation was 'a', and B is writing to v the value 'b' concurrently with the read operation, what is the return value of read going to be? Is it guaranteed to be either 'a' or 'b' or can it be a possibly garbage value too because of the local write or remote read not being atomic?
Thanks,
Sagar
Hi Sagar.
Local Read and Local Write are not atomic and you may get garbage...
If you want to guarantee atomicity, you must use the Atomic operations.
Thanks
Dotan
Thanks for the reply. I can see this happening when we are writing to large memory segments. Is this also true if we are writing to single instance of native data types (bits, bytes, integers, floats etc.)?
If you don't use Atomic operations, there isn't any guarantee to atomic access even for small (and native) data types.
Thanks
Dotan
Hi.
First of all I would say thank you for this site and your comments, they are very useful.
My question :
I know that the atomic operations maybe not very popular, but I have to use it. I have modified rdma-file example to se send one uint64_t-size structure. Also I am using and example provided above. On the server side it is ok - I see when this structure changing. The problem in a client site. I do not understand when and how I can check swapped value: Can I check it directly after the ibv_post_send, or I should wait or made something different? Because now I see nothing after the ibv_post_send, but if I send back some message via different MR, I see the swapped value. can you give me a hint?
Hi Vasily.
Thanks for the feedback
:)
This isn't really true that atomic isn't popular - it depends what you are trying to do..
If you want to examine the value in the client side (i.e. the side that calls ibv_post_send()),
this can be done only after the Send Request processing was ended, i.e. the Work Completion of the corresponding Send Request was polled from the Completion Queue.
Thanks
Dotan
hi Dotan,
When I use ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr) to transfer one large message(200K) using one work request in UD mode, the parameter wr->opcode=IBV_WR_SEND wr->numsge=1.
An error IBV_WC_LOC_LEN_ERR occurs in send side. I am sure the receive buffer is larger enough on receive side.
Does this happen because MTU(4096)< 200K? Do I need to spilt 200K message into multiple work requests?
Hi Songping yu.
UD QP doesn't support more than the path MTU message size:
this value is in the range 256-4096 bytes (depends on your subnet).
It is up to the application to split the (big) message to smaller messages,
using multiple Work Requests or use a different QP transport type.
Thanks
Dotan
Hi,
SO does it Mean that RC supports Max 2GB and UD its is Max 4K?
Hi.
* Maximum message size of RC QPs is 2GB (unless one of the end nodes supports a lower value)
* Maximum message size of UD QPs is 4KB (unless one of the end nodes/switches in the path supports a lower value)
Thanks
Dotan
Hi Dotan.
two questions:
1. I register big memory block, can I send part of it by addr offset、len、rkey?
2. I register many MRs, which is different memory size, when I send msg by RDMA SEND operation, the remote side how to select recv MR?
Thanks!
Ben
Hi.
1) Yes. You can use only part of it in a Work Request.
2) The remote side posts several Receive Requests:
the incoming messages will consume the Receive Requests according to the order they were posted.
i.e. RR[0] will be consumed by message[0], etc.
Thanks
Dotan
Thank you very much,Dotan. These pages are super-useful as IB API reference.
:)
Thanks for the great feedback
Dotan
Hi Dotan.
Thank you very much for your post and help!
Now I meet a problem, when i use ibv_post_send, i got a return value : 12. Before ibv_post_send, i checked the send_wr.sge.addr, it is valid. I paste some code here:
1) create qp:
qp_attr.cap.max_send_wr = 1024;
qp_attr.cap.max_send_sge = 1;
qp_attr.cap.max_recv_wr = 1024;
qp_attr.cap.max_recv_sge = 1;
qp_attr.send_cq = send_cq;
qp_attr.recv_cq = recv_cq;
qp_attr.qp_type = IBV_QPT_RC;
err = rdma_create_qp(cm_id, connection->pd, &qp_attr);
2)query qp attr
if (ibv_query_qp(connection->cm_id->qp, &attr, IBV_QP_STATE | IBV_QP_PATH_MTU | IBV_QP_CAP, &qp_attr))
{
printf("client query qp attr fail\n");
return RETURN_ERROR;
}
I found attr.cap.max_send_wr is equal to 2015, and attr.cap.max_recv_wr is equal to 1024, attr.cap.max_send_sge is equal to 2, attr.cap.max_recv_sge is equal to 1.
3)call ibv_post_send to send msg
memset(&sge, 0, sizeof(sge));
sge.addr = (uint64_t)cmd;
sge.length = sizeof(CMD_S);
sge.lkey = connection->connect_mr[MR_REQ].mr->lkey;
memset(&send_wr, 0, sizeof(send_wr));
send_wr.wr_id = (uint64_t)cmd;
send_wr.next = NULL;
send_wr.sg_list = &sge;
send_wr.num_sge = 1;
send_wr.opcode = IBV_WR_SEND;
send_wr.send_flags = IBV_SEND_SIGNALED;
ret = ibv_post_send(connection->cm_id->qp, &send_wr, &bad_wr);
if (ret != 0)
{
printf("client send connect cmd failed, ret=%d.\n", ret);
return RETURN_ERROR;
}
ret is equal to 12.
I am confused with follow question:
1. I set max_send_wr with 1024, max_send_sge with 1, but when I query qp later, they changed, max_send_wr is 2015, max_send_sge is 2. Why?
2. In my test, multi pthreads will call ibv_post_send. My test has two params, one is thread Num, another is queue depth per thread, the queue is used by test , not rdma queue. My test ran well when params are 8 threads and 32 queue depth, but got error when params are 8 threads and 64 queue depth. And ibv_post_send returns a error value 12.
Please give me some suggestion, help me to find key point to resolve the problem. Thanks.
I'd like to add that, my test creates one qp only to send msg. 8 threads and 32 queue depth means that the qp should handle 8*32 requests one time sometimes. Is the qp limited to handle 256 requests when max_send_wr is setted 1024 ? And is there limits when we use qp to send/rdma read/rdma write ?
Hi.
The QP can handle Work Requests according to the max_send_wr that it was created with
(and this value is limited by the HCA capabilities).
However, please notice the following:
* The Send Requests will be processed according to their order in the QP
* RDMA Read & Atomic parallel processing is limited by max_rd_atomic and max_dest_rd_atomic
for QP as initiator and destination
Thanks
Dotan
Hi.
1. The RDMA device/low level driver can provide more resources than the originally requested value, according to its needs and internal structure
2. I suspect that the Send Queue is full, i.e. you have many outstanding Send Requests (Posted Send Requests that were ended with a Work Completion).
You should either increase the rate of polling out the Work Completions from the CQ or increase the QP.max_send_wr value
Thanks
Dotan
Hi Dotan.
Thanks for your answer!
But I'm still confused that what causes the send queue to be full. My test generates 256 requests total at first time, and uses them recycled. So I think rdma send queue holds 256 work requests most, and should not be full. Could you give me some detailed explanation?
Hi.
A posted Work Request is considered outstanding until a Work Completion was generated for it or for Work Request after it.
You specify in the created QP the number of outstanding Work Requests for either the Send and Receive Queue of that QP.
I suspect that in your example, you post many Send Requests to the QP and don't poll the Work Completions for them.
Thanks
Dotan
Hi Dotan,
First of all, thank you so much for the blog! It is tremendously helpful!
I'm not sure if this is the right place to ask this, but I'm having trouble with one of the sample programs from the RDMA Aware Programming User Manual. I'm not 100% certain, but I believe the problem has to do with ibv_post_send() so this was the best place I could think of to ask. The sample program is from Section 8.2 (Multicast Code Example Using RDMA CM). The basic description of this program is that a sender and receiver create a UD QP, join the multicast group, the sender posts a certain number of sends to the group, and the receiver waits to receive them. When I try to run the program, the sender successfully posts the sends, but the receiver never actually receives them. No errors are returned (from the sender or receiver); the receiver simply waits forever. However, if I add a sleep(1) call just before the sender calls ibv_post_send(), everything works correctly. At first I thought the problem was that the sender was posting the sends before the receives are posted by the receiver, but this does not appear to be the case. Are there any other reasons you know of that would explain why sleep() must be called before ibv_post_send() in this case? Or could this problem be caused by something else entirely and calling sleep() just appears to fix it? I'm not sure if this is a common issue or not; hopefully my question is not too vague. The code I am testing is from Revision 1.7 of the manual, but I can post or email it if that would help; just let me know. I greatly appreciate any help you can give me!
Thanks!
Hi.
Are you aware to the fact that there isn't any synchronization at all between both sides in this test?
i.e. the sender send a message, but the remote side may not be ready to receive it
(its QP isn't in the appropriate state or Receive Request wasn't posted or it hasn't join the multicast group yet).
This is the reason that adding a sleep to the sender will solve the problem...
You can solve it by adding a synchronization between both sides, or letting the server send again and again and waiting for an incoming response from the client.
Thanks
Dotan
Oh, I see. That makes sense. Thank you!
Hi Dotan. Like everyone else, thank you for such an informative resource for RDMA programming. My question: when ibv_post_send is used with one of the atomic opcodes (IBV_WR_ATOMIC_FETCH_AND_ADD or IBV_WR_ATOMIC_CMP_AND_SWAP), do you still need to poll for a completion event to be sure the atomic operation was successful? Or will the operation have completed when ibv_post_send returns?
Hi.
When atomic operations, like any other operation, will end when there is a Work Completion for it
(or for any other Send Request that was posted after it).
When ibv_post_send() returns, this means that the low-level driver enqueues this Send Request for the RDMA device
for future processing.
Thanks
Dotan
Hi Dotan.
Thank you for such a guideline of rdma programing!
And, I have some trouble about IBV_WR_SEND in UD. I use doorbell batching to post my sends (just like wr[i].next = &wr[i+1]). However, only the data of the lattest wr in the batching is received. I am sure that there is no error thrown in my code because if I replace IBV_WR_SEND with IBV_WR_SEND_WITH_IMMEDIATE it works for the same code, the headers arrive correctly. Also, if I just use a post_send for each wr, it works. I think something in the sender side is wrong.
Hope that you can give me some advice!
Thanks!
Hi.
Please make sure that there isn't any race between the sides,
and when the message arrives to the remote side)
1) The remote QP is in (at least) RTR state
2) There are already Receive Requests available in the remote QP
3) The messages are big enough (i.e. at least message size + 40 bytes for the GRH)
Thanks
Dotan
Hi Dotan,
I have a question. When I query my device I get that max_qp_rd_atom operations is 16. So is it not possible more than 16. Why is it specific to RDMA Read operations. I do not see any problem when there are more than 16 Work Requests posted for RDMA Read. What does attr.max_qp_rd_atom mean?
Hi.
RDMA Read operations require special resources and handling in both send and receive side,
so this is the reason for the limitation.
Configuring QP.max_rd_atomic limit the number of processed RDMA Reads that handled by the QP in any time;
you may post as much as you want RDMA Read operations, and the RDMA device will limit the processing.
Thanks
Dotan
Hi,Dotan,I have read many of your articles to learn RDMA programming.
Now I get some problems and try to search result from RDMA_Aware_Programming_User_Manual.pdf (Version 1.7) and the IB Specification Vol 1-Release-1.3-2015-03-03.pdf,but haven't found the result.So I have to turn to you for help.The problem is When I post work request to queuepair,the NIC got notification and fetch the work request from memory to NIC cache by DMA,but when NIC send the data contained in the work reqeust to cabel,does it need to fetch the queuepair information to NIC cache?I know that NIC cache stores the queuepair data,memory address translation data and some network data,but when NIC send data,is the queuepair information necessary?
Hi.
When sending data, the RDMA device needs to fetch QP information:
* QP state
* PKey index
* Qkey (for UD QPs, in specific scenarios)
* Remote side attributes (for connected QPs)
So, the answer is yes.
Thanks
Dotan
Hi Dotan,
If I want to use "ibv_post_send", since we already have "IBV_WR_SEND", why we need "IBV_WR_RDMA_WRITE"? Is there any performance difference between these two approaches?
Hi.
Yes. There is a performance difference:
* Send operation will consume a Receive Request in the remote side
* RDMA Write operation won't, and a PCI read will be prevented (better latency)
Thanks
Dotan
Great! Thanks Dotan.
thanks for your answer! Dotan
After reading all conversions in this post above,I have one more curious question...(sorry for disturbing).
The question is:When, where and how is the necessary QP information being collected for posting send wr?
First,please allow me sort out some procedure and explain my understanding.
When I post ibv_send_wr* wr using ibv_post_send,things goes on follow:
1.No context switch,in the same context,the ibv_post_send function transforms ibv_send_wr* wr(libibverbs abstraction) to WQE (HW-specific send request,the WQE is writing in Ethernet_Adapter_programming_Mannual,),during constructing WQE,it demands Ctrl Segment,Eth segment,Memory Management segment,Data segment,and Ctrl segment includes the attribute of SQ number(which
seems the necessary information about QP)
2.after constructing new WQE,writing the WQE to the WQE buffer,and update Doorbell record associated with that queue.(ibv_post_send api returns)
3.Device got notification and asynchronously processes these new WQEs.
4.After work request being processed, NIC writes cqes to relevent cq by dma.
5.I poll cq and got notifications.
ok,the whole procedure is sorted.Is there existing some error?
From proceduer above,can guess the collecting necessary QP information happens at transforming ibv_send_wr to WQE(it means calling ibv_post_send)?
And another question...(sorry for my curiousity),as far as i know,in software level,the qp num is the unique identitfier to steer network message flow to corresponding qp,in hardware level,the gid and port is the unique identifier to steer packet flow.So summarize for above question, can i treat "fetching QP information for work request" as "fetching qp num and other non-unique information"?
Sorry for too much words,But I really interested in this part.If I expressed poorly,please point out and I will improve.Thanks for you patience!Dotan
Hi.
This is an interesting question.
After the following step:
"2.after constructing new WQE,writing the WQE to the WQE buffer,and update Doorbell record associated with that queue.(ibv_post_send api returns)"
The WQE was enqueued to the RDMA device for processing; when the processing will actually start the RDMA device needs to collect relevant information for the QP:
* The QP type
* Remote QP number (for connected QP)
* Path to the remote QP (for connected QP)
* Send PSN
* more
I hope that I answered your question.
Thanks
Dotan
Hi,Dotan
I got it.There is still so much things which device need to do.
Sorry for my recklessness.I should carefully read the driver source code and then ask my questions.But I do learn very much from your detailed articles.Thanks for your patience and generosity.
:)
Hi Dotan
I want to transform data from serverA's memory to serverB's memory, then I use ibv_post_send(), doing RDMA write, if return value of ibv_post_send is equal zero, does it mean that the data has transformed from serverA's memory to serverB's memory?
Hope that you can give me some advice!
Thanks!
Hi.
No.
If ibv_post_send() returns with the value 0,
this means that this Send Request was added for the RDMA device for further processing.
If this is a reliable transport type, and there is a Work Completion with the SUCCESS status,
this means that the data was written to remote memory successfully.
Thanks
Dotan
Hi Dotan:
I am new to RDMA and I tried to do a RDMA RC Write. Everything works fine when the message size is smaller than MTU. However, when I set my message size larger than MTU, the side which post the Write is not able to get any write completion in the CQ. Even though the remote side already have completed data in the registered memory. There is no error message at both side. The side who post the write is stuck in the while loop of ibv_poll_cq(). I would like to ask what might be the problem of this.
Thanks,
Sylvia
Hi.
Are you using RoCE or InfiniBand?
Did you configured the same MTU in both sides?
Thanks
Dotan
Hi Dotan,
I wrote a ping-pong program with IBV_WR_SEND, it's server/client like. The problem I met was sending and receiving 1M 4096 bytes took 26s, while the ibv_post_send call took 9s. Is this normal? Or is there any reason leading to the ibv_post_send blocking?
Hi.
What do you mean "ibv_post_send call took 9s"?
First of all - it is too much time for a fast network and seconds are "infinite".
Second, need to understand what you did to give an answer.
Thanks
Dotan
hello, Dotan!
I met the problem that many guys mentioned. When I repeat to write and read remote memory, I got the "ENOMEM". I try to empty the CQ at both client and server using ibv_poll_cq, but it didn't work. Please help me! Thanks:)
/*my code seems like that: */
while(1){
...
send_wr.opcode = IBV_WR_RDMA_WRITE;
send_wr.sg_list =&sge;
...
ret = ibv_post_send(cm_id->qp, &send_wr, &bad_send_wr);
...
send_wr.opcode = IBV_WR_RDMA_READ;
send_wr.sg_list =&sge;
...
ret = ibv_post_send(cm_id->qp, &send_wr, &bad_send_wr);
if (ret == EINVAL){
printf("invalid value provided in wr\n");
}else if(ret == ENOMEM){
printf("send queue is full\n");
do {
ne = ibv_poll_cq(cq, 1, &wc);
if (ne < 0) {
fprintf(stderr, "Failed to poll completions from the CQ: ret = %d\n",
ne);
break;
}
/* there may be an extra event with no completion in the CQ */
if (ne == 0)
continue;
if (wc.status != IBV_WC_SUCCESS) {
fprintf(stderr, "Completion with status 0x%x was found\n",
wc.status);
break;
}
} while (ne);
}else if(ret == EFAULT){
printf("invalid value provided in qp\n");
}else if(ret == errno){
printf("on failure and no change will be down to the qp");
}
}
Hi.
There are 2 options:
1) There aren't any Work Completion (and won't be) since you didn't request for generation of them
(ibv_qp_init_attr.sq_sig_all for all Send Requests in that QP or in ibv_send_wr.send_flags - per specific Send Request)
2) There processing is still on going,
for example: if there is a retransmission and the timeout is very high (or infinite).
Did you read any Work Completion from that CQ?
(from the Send Queue)
Thanks
Dotan
Hi Dotan,
Can Atomic ops (CAS and FAA) be made inline? I can see In the documentation that it is supported in the experimental verbs, but I can't find any source that explains it for the normal case.
Hi.
Experimental verbs are vendor specific,
and I prefer not to answer on such questions.
Please contact the relevant vendor support team/developers.
(sorry)
Dotan
Hi,
I have a question about RDMA write speeds.
If I issue 1000 small RMDA WRITE REQUESTS (IBV_OPCODE_RDMA_WRITE_ONLY) of payload ~2000 bytes for each packet, would the data arrive faster or slower than than issuing a single RMDA WRITE of size (2000 bytes * 1000)?
Hi.
Sorry, I missed this question, I'll answer in case someone will ask this in the future;
The performance of once big message is better compared to many small messages.
The reasons are:
* In One big message: only one WQE is fetched and not many WQEs (fetch time is saved, cache misses, less completions, etc.)
* In One big message: Total number of packets can be smaller (it can better utilize the MTU)
* Less number of ACKs will be sent
Thanks
Dotan
Hi,
I am trying to implement basic client-server using UD instead of RC. Does any have notes, code for UD? I don't see much about UD anywhere on the internet.
Thanks
Hi.
You can find an example to a UD program in the following URL:
https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/examples/ud_pingpong.c
I'll write a post about the differences between the various transport type and porting programs between them soon.
Thanks
Dotan
Hi Dotan. As always, thank you for the very helpful website. Regarding the atomics opcodes, the descriptions for both IBV_WR_ATOMIC_FETCH_AND_ADD and IBV_WR_ATOMIC_CMP_AND_SWP state that the updates to the remote virtual memory specified in atomic.remote_addr are done atomically, but imply the writing of the original data to the local sg_list is not atomic. When I issue one of these calls and then get a successful completion event for them, what is the state of the atomic operation at that point? Does it mean the atomic operation on the remote is complete and the local sg_list contains the original value from the remote? With a typical RDMA write, a successful completion event means the data has been successfully written to the remote HCA (per earlier comments in this post) and I can re-use the local sg_list buffers. However, for the atomics operations, the sg_list buffers aren't the source of the data, but rather the destination for the original value from the remote side. So I'm trying to understand exactly what a successful CQ event for an atomics operation tells me about the state that operation. How do I know when it's complete?
Hi.
The writing of data to *local* s/g list isn't atomic;
the access to the remote address is being done in atomic way.
Thanks
Dotan
Hi,
IBV_EXP_SEND_INLINE:
This mean that the low-level driver (i.e. CPU) will read the data and not the RDMA device.
I am not quite clear about this line. I assume that in most NICs interface, there is some WQE entry, and this means that the NIC driver will read the data and put it in the WQE inside. Then there will be less PCIE DMA transaction. Am I right?
Hi.
Yes, inline means that the low level driver (i.e. CPU) will read the data,
and not the RDMA device itself.
Since will prevent an extra PCIE DMA transaction, to fetch the data.
This provides better latency compared to non-inline for small messages.
Thanks
Dotan
Hi Dotan, I have a few questions about the behavior of RC QP:
if 2 linked WRITE WRs are posted to the same QP (the first being WRITE and the second being WRITE_WITH_IMM), does the completion of the second WR in peer guarantee the completion of the first?
My guess is that it being a QP with connection behavior, it does. Does it hold for UD QPs?
To delve a little deeper, after which low-level event is the send completion triggered in a RC QP?
1) After the buffer posted for write has been fully put on wire
2) After the ACK was received from the recipient
Also, is the remote memory already written to before the peer responds with an ACK?
What is guaranteed in terms of ordering for WRs posted to non-RC QPs?
PS: Thank you for keeping this blog afloat. You're great help.
Hi.
In RC QP, there is a PSN (Packet Serial Number) that guarantees the order of the messages;
In your scenario - the completion of the second message guarantees the arrival (and completion) of the first one.
In UD QP, there isn't any RDMA Write support; only Sends/With immediate.
Every send will generate a completion in the remote side upon completion.
Local side will have a completion (if requested), once Ack will be received for this or any subsequent message.
Remote side will send an Ack once data was DMAed to its memory; and create a completion (if needed).
I hope that this was clear.
Thanks
Dotan
Hi Dotan,
The S/G list in ibv_post_send() is a nice way to consolidate data transfers. However, can I initiate an operation writing to multiple locations on the remote side (or reading from multiple locations on the remote side)? This feature is appealing because it efficiently uses the wide RDMA lanes for multiple small writes. I also checked the Intel qsm APIs and the Cray gni APIs. It seems no one support such a feature--let's call it "writer-controlled remote scatter". Is there a deep reason this is not supported?
Hi.
RDMA operation doesn't (currently?) support this feature;
it wasn't defined in any spec.
IMHO, it doesn't have any benefit compared to posting several Send Requests...
Thanks
Dotan
Hi Dotan.
What will happen when I run the loopback test?
Will the data included in the send WQE be fetched to the NIC TX cache and go through the IB protocol, UDP protocol and IP protocol processing and return to the NIC RX cache, then be delivered to the receive WQE?
Thank you very much.
Haonan Qiu
Hi.
Here is a port about it:
https://www.rdmamojo.com/2018/12/29/loopback-messages-in-rdma/
In (a very short) summary:
the answer is yes: a WQE will be fetched and processed and data will be DMA'ed locally;
no memcpy() by any SW stack will be done (as it is done in loopback over Ethernet interface).
Thanks
Dotan
Hi,
I have a question about concurrent RDMA operations. Machine A sends a RDMA write to data item x stored in machine C, meanwhile, machine B sends a RDMA read to the same data item x stored in machine C through a different QP. I understand there is a race condition, but I'm not sure whether it is possible that machine B reads some corrupted data (data item x is only partially modified by machine A)?
Thank you for your answer in advance! And any hint about where I can find the answer is also appreciated!
Hi.
I can't give a good answer here;
the behavior is implementation-dependent: of the RDMA device, of the memory controller, cache system...
Unless you sync the memory access, there isn't any guarantee what will be the possible result of such operation.
If you'll try it many times, your may get the same results (for example, always read as a 64/128 bytes block) but there isn't any guarantee that it will always be the case.
Thanks
Dotan
Hi,
Thank you for such a useful blog!
And I have one question: When the remote RNIC will create an ACK for RDMA Write, when the request is cached in the remote RNIC, or directly DMAed to remote memory and completed?
The workload is: I have an array in local server (init value: array[i] = 0), and a count showing what's the latest validate element, (an element array[i] is valid if array[i] = i). Then, I allocate two qps, array_qp and count_qp, each is connected to a different RNIC on the remote server.
The workflow is a while loop containing following ordered steps:
1. local_array[count] = count
2.array_qp RDMA Writes local_array[count] to remote_array[count]
3. poll array_qp's completion . // assumeing local_array[count] is written successfully
4. ++count
5.array_qp RDMA Write count to remote_count.
6. selectively poll array_qp's completion
I assume that, for the remote server, if it observes a remote_count == i, for each element array[j] == j, (j <=i).
However, the assumption failed. It seems that the remote server observes a remote_count == i, while array[i] != i. The key is that array_qp and count_qp are connected to different RNICs, if they are connected to the same RNIC, there are no problems.
This really confused me.
Thanks
Hi.
RDMA Write request in the responder side is written to PCI once the message arrives.
In theory, it should work no matter if you are using one or more RNICs.
I suspect that the problem that you see relevant to the system memory management(NUMA?).
Try to make sure that the writes will be to different cache lines.
Thanks
Dotan
Hi, Dotan,
I have a question here,
Is there any method that the receive side can aware the transmission is done when using IBV_WR_RDMA_WRITE opcode at sender?
If not, is IBV_WR_RDMA_WRITE_WITH_IMM the replacement method?
Should I per-post ibv_post_recv() at receive side and what else steps should I do?
I have tried
sge.addr = (uintptr_t)buf;
sge.length = sizeof(uint32_t);
sge.lkey = mr->lkey;
recv_wr.wr_id = 0;
recv_wr.sg_list = &sge;
recv_wr.num_sge = 1;
if (ibv_post_recv(cm_id->qp,&recv_wr,&bad_recv_wr))
return 1;
but get completion error.
thanks
Hi.
The receiver side isn't aware to the fact that RDMA Write is performed (except maybe for reading the memory buffers),
if needed to sync in "RDMA way", RDMA Write with immediate is a good solution.
This is the right code to post a Receive Request.
I don't understand what you mean "but get completion error".
Thanks
Dotan
Hi Dotan,
Thanks for reply.
Sorry for my poor explanation.
At sender, after setting cq and qp, I pre-post recv and then post send msg with IBV_WR_RDMA_WRITE_WITH_IMM. Finally use ibv_get_cq_event() and ibv_poll_cq()
However when I execute the sender, it output
"mlx5: localhost.localdomain: got completion with error"
I try to print the wc stauts it returns "error RNR retry counter exceeded"
At receiver, after create cq and qp, I pre-post recv and call ibv_get_cq_event() and ibv_poll_cq()
Is anything I do wring?
Thanks,
BR
Hi.
Is there any synchronization between the sides?
Is it possible that when the sender QP post the RDMA_WRITE_WITH_IMM there isn't any Receive Request at the receiver side?
And this is the reason for the RNR completion.
You can resolve it by adding a synchronization or increase the RNR attributes in the QPs.
Thanks
Dotan
Thanks
Dotan
Hi Dotan,
it seems that the usage of memory window is not discussed here. Is it possible to deny inflight RDMA write using memory window(or other ways)? Thank you very much.
Hi.
I didn't write any posts on Memory Windows (yet?),
the Memory Window has a permission: allow or deny incoming RDMA Write, Read or Atomic to it.
I hope it is clear.
Thanks
Dotan
Hi Dotan,
I have a question regarding the use of ibv_send_wr and ibv_sge in a setting where the first request in a batch is signaled. There is a preconfigured array for struct ibv_send_wr and struct ibv_sge. Once the call to ibv_post_send() returns, is it safe to reuse the ibv_send_wr and ibv_sge i.e. does ibv_post_send() guarantees that the RDMA NIC will complete the DMA for the linked list of work requests and only then returns. Also thanks for maintaining this wonderful blog.
Hi.
The answer is yes:
Once the ibv_post_send() is returned, you can (safety) use the ibv_sge and ibv_send_wr
(it doesn't matter if this send request is signaled or not).
The ibv_post_send() verb copies the ibv_send_wr to the send queue of the QP
(after translating it to the descriptor that the RDMA device "understands").
However, you can't reuse the buffers that the ibv_sge points to,
until you'll get a work completion to it (or to a send request that was post after it).
Thanks
Dotan
Hi Dotan,
Thanks for this blog!
A quick question, Is there a way to find out whether or not an RDMA CAS operation successfully swapped the value in the remote memory?
Best,
Shubham
Hi.
For a Cmp&Swap:
At the responder side, you can't know whether or not the value was swapped
(since this is an RDMA operation).
At the requestor side, you will get back the original value in the remote address
(so, if you'll keep the relevant info - you can know whether remote value was swapped or not).
I hope that this helped you
Dotan
Hello Mr. Barak,
In your reply to Weijia, in this post you said:
"IMHO, it (opposite s/g - an operation writing to multiple locations on the remote side or reading from multiple locations on the remote side) doesn't have any benefit compared to posting several Send Requests"
Could you explain a bit more about it?
I do see a difference in time between a single request of contiguous range and the same range divided to many requests. An opposite s/g would help me a lot.
Thanks, Alla
Hi.
This feature (i.e. writing to/reading from multiple remote operations in one message) will add more complexity:
* The Send Request will be complicated, since you'll need to provide list of remote addresses + keys
* Each RDMA Write/Read requests will need to specify list of remote address + keys (headers will be larger, or multiple headers)
* Now, the headers can have variable size (depends on number of accessed remote blocks)
So IMHO, eventually, to make everything work for one remote block and multiple remote blocks,
it will be very similar for processing multiple Send Requests
(this is implementation dependent, but each remote block access request will be treated as it was posted in a different Send Request).
I agree that posting one Send Request is better compared to posting multiple Send Requests,
but IMHO, the complexity that it brings is high and the ROI is low.
This is my opinion...
Dotan
hi,I have a question about IBV_SEND_INLINE,when i look for sample,i found that some people use it like this:
send_flags=IBV_SEND_INLINE
when some one like this:
send_flags=IBV_SEND_INLINE|IBV_SEND_SIGNALED 。
could someone tell me the difference between them and usage scenarios?
Hi.
IBV_SEND_INLINE specify the low-level driver to read the payload data (i.e. using the CPU) and not by the RDMA device
IBV_SEND_SIGNALED specify the RDMA device to generate a Work Completion at the end of processing this Send Request.
Thanks
Dotan
Hi,nice to meet you
I confuse about that how to specify rdma_post_send's opcode ?
when use vbers api ibv_post_send ,i know specify opcode by set ibv_send_wr.opcode = IBV_WR_RDMA_READ;
could you tell how to specify the opcode when use rdma_post_send ?
Hi.
rdma_post_send() is actually a wrapper over ibv_post_send()
(it may do more things, but at the end of the day - it is a wrapper).
Thanks
Dotan