Skip to content

ibv_poll_cq()

Contents

4.75 avg. rating (95% score) - 24 votes
int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc);

Description

ibv_poll_cq() polls Work Completions from a Completion Queue (CQ).

A Work Completion indicates that a Work Request in a Work Queue, and all of the outstanding unsignaled Work Requests that posted to that Work Queue, associated with the CQ are done. Any Receive Requests, signaled Send Requests and Send Requests that ended with an error will generate a Work Completion after their processing end.

When a Work Requests end, a Work Completion is being added to the tail of the CQ that this Work Queue is associated with. ibv_poll_cq() check if Work Completions are present in a CQ and pop them from the head of the CQ in the order they entered it (FIFO). After a Work Completion was popped from a CQ, it can't be returned to it.

One should consume Work Completions at a rate that prevents the CQ from being overrun (hold more Work Completions than the CQ size). In case of an CQ overrun, the async event IBV_EVENT_CQ_ERR will be triggered, and the CQ cannot be used anymore.

The struct ibv_wc describes the Work Completion attributes.

struct ibv_wc {
	uint64_t		wr_id;
	enum ibv_wc_status	status;
	enum ibv_wc_opcode	opcode;
	uint32_t		vendor_err;
	uint32_t		byte_len;
	uint32_t		imm_data;
	uint32_t		qp_num;
	uint32_t		src_qp;
	int			wc_flags;
	uint16_t		pkey_index;
	uint16_t		slid;
	uint8_t			sl;
	uint8_t			dlid_path_bits;
};

Here is the full description of struct ibv_wc:

wr_id The 64 bits value that was associated with the corresponding Work Request
status Status of the operation. The value can be one of the following enumerated values and their numeric value:

  • IBV_WC_SUCCESS (0) - Operation completed successfully: this means that the corresponding Work Request (and all of the unsignaled Work Requests that were posted previous to it) ended and the memory buffers that this Work Request refers to are ready to be (re)used.
  • IBV_WC_LOC_LEN_ERR (1) - Local Length Error: this happens if a Work Request that was posted in a local Send Queue contains a message that is greater than the maximum message size that is supported by the RDMA device port that should send the message or an Atomic operation which its size is different than 8 bytes was sent. This also may happen if a Work Request that was posted in a local Receive Queue isn't big enough for holding the incoming message or if the incoming message size if greater the maximum message size supported by the RDMA device port that received the message.
  • IBV_WC_LOC_QP_OP_ERR (2) - Local QP Operation Error: an internal QP consistency error was detected while processing this Work Request: this happens if a Work Request that was posted in a local Send Queue of a UD QP contains an Address Handle that is associated with a Protection Domain to a QP which is associated with a different Protection Domain or an opcode which isn't supported by the transport type of the QP isn't supported (for example: RDMA Write over a UD QP).
  • IBV_WC_LOC_EEC_OP_ERR (3) - Local EE Context Operation Error: an internal EE Context consistency error was detected while processing this Work Request (unused, since its relevant only to RD QPs or EE Context, which aren’t supported).
  • IBV_WC_LOC_PROT_ERR (4) - Local Protection Error: the locally posted Work Request’s buffers in the scatter/gather list does not reference a Memory Region that is valid for the requested operation.
  • IBV_WC_WR_FLUSH_ERR (5) - Work Request Flushed Error: A Work Request was in process or outstanding when the QP transitioned into the Error State.
  • IBV_WC_MW_BIND_ERR (6) - Memory Window Binding Error: A failure happened when tried to bind a MW to a MR.
  • IBV_WC_BAD_RESP_ERR (7) - Bad Response Error: an unexpected transport layer opcode was returned by the responder. Relevant for RC QPs.
  • IBV_WC_LOC_ACCESS_ERR (8) - Local Access Error: a protection error occurred on a local data buffer during the processing of a RDMA Write with Immediate operation sent from the remote node. Relevant for RC QPs.
  • IBV_WC_REM_INV_REQ_ERR (9) - Remote Invalid Request Error: The responder detected an invalid message on the channel. Possible causes include the operation is not supported by this receive queue (qp_access_flags in remote QP wasn't configured to support this operation), insufficient buffering to receive a new RDMA or Atomic Operation request, or the length specified in a RDMA request is greater than [latex]2^{31}[/latex] bytes. Relevant for RC QPs.
  • IBV_WC_REM_ACCESS_ERR (10) - Remote Access Error: a protection error occurred on a remote data buffer to be read by an RDMA Read, written by an RDMA Write or accessed by an atomic operation. This error is reported only on RDMA operations or atomic operations. Relevant for RC QPs.
  • IBV_WC_REM_OP_ERR (11) - Remote Operation Error: the operation could not be completed successfully by the responder. Possible causes include a responder QP related error that prevented the responder from completing the request or a malformed WQE on the Receive Queue. Relevant for RC QPs.
  • IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter Exceeded: The local transport timeout retry counter was exceeded while trying to send this message. This means that the remote side didn't send any Ack or Nack. If this happens when sending the first message, usually this mean that the connection attributes are wrong or the remote side isn't in a state that it can respond to messages. If this happens after sending the first message, usually it means that the remote QP isn't available anymore. Relevant for RC QPs.
  • IBV_WC_RNR_RETRY_EXC_ERR (13) - RNR Retry Counter Exceeded: The RNR NAK retry count was exceeded. This usually means that the remote side didn't post any WR to its Receive Queue. Relevant for RC QPs.
  • IBV_WC_LOC_RDD_VIOL_ERR (14) - Local RDD Violation Error: The RDD associated with the QP does not match the RDD associated with the EE Context (unused, since its relevant only to RD QPs or EE Context, which aren't supported).
  • IBV_WC_REM_INV_RD_REQ_ERR (15) - Remote Invalid RD Request: The responder detected an invalid incoming RD message. Causes include a Q_Key or RDD violation (unused, since its relevant only to RD QPs or EE Context, which aren't supported)
  • IBV_WC_REM_ABORT_ERR (16) - Remote Aborted Error: For UD or UC QPs associated with a SRQ, the responder aborted the operation.
  • IBV_WC_INV_EECN_ERR (17) - Invalid EE Context Number: An invalid EE Context number was detected (unused, since its relevant only to RD QPs or EE Context, which aren't supported).
  • IBV_WC_INV_EEC_STATE_ERR (18) - Invalid EE Context State Error: Operation is not legal for the specified EE Context state (unused, since its relevant only to RD QPs or EE Context, which aren't supported).
  • IBV_WC_FATAL_ERR (19) - Fatal Error.
  • IBV_WC_RESP_TIMEOUT_ERR (20) - Response Timeout Error.
  • IBV_WC_GENERAL_ERR (21) - General Error: other error which isn't one of the above errors.
opcode The operation that the corresponding Work Request performed. This value controls the way that data was sent, the direction of the data flow and the valid attributes in the Work Completion. The value can be one of the following enumerated values:

  • IBV_WC_SEND - Send operation for a WR that was posted to the Send Queue
  • IBV_WC_RDMA_WRITE - RDMA Write operation for a WR that was posted to the Send Queue
  • IBV_WC_RDMA_READ - RDMA Read operation for a WR that was posted to the Send Queue
  • IBV_WC_COMP_SWAP - Compare and Swap operation for a WR that was posted to the Send Queue
  • IBV_WC_FETCH_ADD - Fetch and Add operation for a WR that was posted to the Send Queue
  • IBV_WC_BIND_MW - Memory Window bind operation for a WR that was posted to the Send Queue
  • IBV_WC_RECV - Send data operation for a WR that was posted to a Receive Queue (of a QP or to an SRQ)
  • IBV_WC_RECV_RDMA_WITH_IMM - RDMA with immediate for a WR that was posted to a Receive Queue (of a QP or to an SRQ). For this opcode, only a Receive Request was consumed and the sg_list of this RR wasn't used
vendor_err Vendor specific error which provides more information if the completion ended with error. This value provides a hint to the RDMA device's vendor about the reason of the failure in case there is a Work Completion that ended with error
byte_len The number of bytes transferred. Relevant if the Receive Queue for incoming Send or RDMA Write with immediate operations. This value doesn't include the length of the immediate data, if such exists. Relevant in the Send Queue for RDMA Read and Atomic operations.

For the Receive Queue of a UD QP that is not associated with an SRQ or for an SRQ that is associated with a UD QP this value equals to the payload of the message plus the 40 bytes reserved for the GRH.
The number of bytes transferred is the payload of the message plus the 40 bytes reserved for the GRH, whether or not the GRH is present

imm_data (optional) A 32 bits number, in network order, in an SEND or RDMA WRITE opcodes that is being sent along with the payload to the remote side and placed in a Receive Work Completion and not in a remote memory buffer. This value is valid if the IBV_WC_WITH_IMM is set
qp_num Local QP number of completed WR. Relevant for Receive Work Completions that are associated with an SRQ
src_qp Source QP number (remote QP number) of completed WR. Relevant for Receive Work Completions of a UD QP
wc_flags Flags of the Work Completion. It is either 0 or the bitwise OR of one or more of the following flags:

  • IBV_WC_GRH - Indicator that GRH is present for a Receive Work Completions of a UD QP. If this bit is set, the first 40 bytes of the buffered that were referred to in the Receive request will contain the GRH of the incoming message. If this bit is cleared, the content of those first 40 bytes is undefined
  • IBV_WC_WITH_IMM - Indicator that imm_data is valid. Relevant for Receive Work Completions
pkey_index P_Key index. Relevant for GSI QPs
slid Source LID (the base LID that this message was sent from). Relevant for Receive Work Completions of a UD QP
sl Service Level (the SL LID that this message was sent with). Relevant for Receive Work Completions of a UD QP
dlid_path_bits Destination LID path bits. Relevant for Receive Work Completions of a UD QP (not applicable for multicast messages)

The following test (opcode & IBV_WC_RECV) will indicate that the status of a completion is from the Receive Queue.

For a receive Work Completions of a UD QP, the data start at offset 40 from the posted receive buffer start whether if the IBV_WC_GRH bit it set or not.

Not all wc attributes are always valid. If the completion status is other than IBV_WC_SUCCESS, only the following attributes are valid:

  • wr_id
  • status
  • qp_num
  • vendor_err

Parameters

Name Direction Description
cq in Completion Queue that was returned from ibv_create_cq()
num_entries in Maximum number of Work Completions to read from the CQ
wc out Array of size num_entries of the Work Completions that will be read from the CQ

Return Values

Value Description
Positive Number of Work Completions that were read from the CQ and their value was returned in wc. If this value is less than num_entries it means that there aren't any more Work Completions in the CQ. If this value equals to num_entries, maybe there are more Work Completions in the CQ
0 The CQ is empty
Negative A failure occurred while trying to read Work Completions from the CQ

Examples

Poll a Work Completion from a CQ (in polling mode):

struct ibv_wc wc;
int num_comp;
 
do {
	num_comp = ibv_poll_cq(cq, 1, &wc);
} while (num_comp == 0);
 
if (num_comp < 0) {
	fprintf(stderr, "ibv_poll_cq() failed\n");
	return -1;
}
 
/* verify the completion status */
if (wc.status != IBV_WC_SUCCESS) {
	fprintf(stderr, "Failed status %s (%d) for wr_id %d\n", 
		ibv_wc_status_str(wc.status),
		wc.status, (int)wc.wr_id);
	return -1;
}

FAQs

What is that Work Completion anyway?

Work Completion means that the corresponding Work Request is ended and the buffer can be (re)used for read, write or free.

Does ibv_poll_cq() cause a context switch?

No. Polling for Work Completions doesn't cause a context switch at all; this is why RDMA technologies can achieve very low latency (below 1 usec).

Is there a limit to the number of Work Completions that can we polled when calling ibv_poll_cq()?

No. One can read as many Work Requests that he wishes.

I called ibv_poll_cq() and it filled all of the array that I've provided to it. Can I know how many more Work Completions exist in the CQ?

No, you can't.

I got a Work Completion from the Receive Queue of a UD QP and it ended well. I read the data from the memory buffers and I got bad data. Why?

Maybe you looked at the data starting offset 0. For any Work Completion of a UD QP, the data is placed in offset 40 of the relevant memory buffers, no matter if GRH was present or not.

What is this GRH and why do I need it?

The Global Routing Header (GRH) provides information that is most useful for sending a message back to the sender of this message if it came from a different subnet or from a multicast group.

I've got completion with error status. Can I read all of the Work Completion fields?

No. If the Work Completion status indicates that there is an error, only the following attributes are valid: wr_id, status, qp_num, and vendor_err. The rest of the attributes are undefined.

I read a Work Completion from the CQ and I don't need. Can I return it to the CQ?

No, you can't.

Can I read Work Completion that belongs to a specific Work Queue?

No, you can't.

What will happen if more Work Completion than the CQ size will be added to it

There will be a CQ overrun and the CQ (and all of the QPs that are associated with it) will move into the error state.

Share Our Posts

Share this post through social bookmarks.

  • Delicious
  • Digg
  • Newsvine
  • RSS
  • StumbleUpon
  • Technorati

Comments

Tell us what do you think.

  1. Omar Khan says: September 20, 2013

    If i post an RDMA send, how would i know that the receiving side has received the buffer. Does the entry in the Completion queue of the sender, indicate that the receiver has received the data, or does it only indicate that the sender can now reuse the buffer.

    Regards

    • Dotan Barak says: September 20, 2013

      Hi Omar.

      The question is: which QP transport type are you using?
      Assuming that the Work Completion was ended successfully:

      • For Reliable QP (for example, RC): this means that the sent buffer was written at the receiver side.
      • For Unreliable QP: this means that the sent buffer can be reused, since the message was already sent.

      I hope that this answer helped you.

      Thanks
      Dotan

      • Alan says: April 4, 2014

        Hi,

        In your previous post regarding upon the end of a successful Work Completion using RC RDMA Write, you said it means the send buffer was written at the receiver side. My question is what does the "receiver side" mean? Does it mean the user memory at the remote or the HCA on the remote?

        I saw some posts that point out that a successful Work Completion for a RDMA Write doesn't mean user can read the data on the receiver buffer.

        Did I misunderstand something?

        Thanks.

        Alan

      • Dotan Barak says: April 4, 2014

        Hi Alan, this is a great question.

        The receiver side is the responder side (remote side).
        I mean that the data was received to the remote side HCA and in almost all cases was written to its memory.

        However, the remote side doesn't know that the RDMA Write was finished to its memory
        (it doesn't have any indication that RDMA Write was performed to its memory or that it was finished).

        Sure, it can inspect the memory and see that it was changed but if the last byte was changed it doesn't necessary mean that the whole buffer changed.

        I think that it is better to be cautious and wait until the remote side will have a Work Completions on this QP. But I guess, other methods can be used instead.

        Did I answer your question?

        Thanks
        Dotan

      • Alan says: April 4, 2014

        Hi Dotan,
        Thanks for the fast reply. It answers part of my question. In certain scenario I cannot poll cq on the remote side, so there is no way for me to get and process the Work Completion on the remote. I am not sure if doing something as following would help:
        =============================

        RDMA_Write(big user data);
        RDMA_Read (last byte of the date from remote);
        wait Work Completion for both of them.
        RDMA_Write (flag);

        while (!flag) ;
        check data from sending side;
        =============================
        Please note that the two RDMA_Write may use different QPs. But the RDMA_Read will use the same QP as the 1st RDAM_Write.

        Thanks.

        Alan

      • Dotan Barak says: April 4, 2014

        Hi Alan.

        I'm sorry, but I didn't understand what you are doing;
        which operations is performed by every side, and in which QP.
        What is the reason that you try to write and then read the last byte of the data?

        Please note that there isn't any guarantee between messages from different queues.

        Thanks
        Dotan

      • Alan says: April 4, 2014

        Hi Dotan,

        I am sorry I didn't make it clear.

        What I want to know is that if the receiving of the Work Completion of a RDMA_Read which follows a RDMA_Write on the same QP would guarantee (or force) the data of the RDMA_Write being written into the remote memory.

        Thanks.

        Alan

      • Dotan Barak says: April 5, 2014

        The question: is why do you assume that the memory will be written to memory after the RDMA Read was completed?
        (and why do you assume that it won't be written in the first place).

        Can you please send me the reference to the post that you are referring to?

        Thanks
        Dotan

      • Alan says: April 5, 2014

        Hi Dotan,

        Here is one of the links: http://lists.openfabrics.org/pipermail/general/2007-May/036615.html

        The other place is in the print outs we had for IB education years ago.

        Regards,

        Alan

      • Dotan Barak says: April 19, 2014

        Hi Alan.

        (I'm not considering my self as a PCI express or computer architecture expert, so I hope that I'm not confusing you with this answer).

        As far as I understand, this question is a little bit tricky; since it isn't related to RDMA.

        The same problem can happen to you when you send data using Send opcode as well
        (and may happen in other network architecture that allow HW offloads,
        and in some cases even when using sockets).

        The data that you want to write to the memory *may* be different than the memory that was actually written to memory because of errors/bit flips any kind of error that may happen between the time that data was reached to the remote side HW and the time that data was written to the memory.

        Actually, this kind of errors can happen when you are accessing local memory, without performing any data transfer with any memory.

        So, I think that this issue isn't related to RDMA.

        BTW, if you want to make sure that the same content was written you can add checksums to your data.

        Thanks
        Dotan

  2. Omar Khan says: January 23, 2014

    Hi
    i want to know one thing. if i get a "IBV_WC_RNR_RETRY_EXC_ERR" when I poll the completion queue, can i repoll the queue after a while or does my queue enter an error state and cannot be used any more.

    regards
    Omar

    • Dotan Barak says: January 23, 2014

      Hi Omar.

      You are polling a CQ for Completion. If you get a Completion with bad status
      (e.g. "IBV_WC_RNR_RETRY_EXC_ERR"), the QP itself enter to error state and cannot be used.

      However, the CQ itself is still valid and fully functional; If this CQ is being used in several QPs,
      one/some of them may get into error and the rest of them can still be fully functional...

      I hope that I answered
      Dotan

  3. Omar Khan says: June 25, 2014

    Dear Dotan
    I want to know if it is necessary to poll the send completion queue after each ibv_post_send whether it's for RDMA WRITE OR normal send. Polling the send completion queue is time consuming and takes almost 10 microseconds on our cluster and if I do not poll the send completion queue, I overflow it after the maximum send queue counter set for the queue pairs. Is it possible that I do not generate a completion entry for send operation. Please share with me some code snippet where I set up the queue pairs such that for each entry added to the send queue no completion is generated.
    Hopefully I have made my point clear.

    Regards
    Omar Khan

    • Dotan Barak says: June 26, 2014

      Hi Omar.

      You don't have to poll the Send Completion Queue after every call to ibv_post_send();
      you can create the Queue Pair and specify that a Work Completion isn't needed for each Send Request:

      struct ibv_qp_init_attr attr = {
      .send_cq = ctx->cq,
      .recv_cq = ctx->cq,
      .cap = {
      .max_send_wr = 1,
      .max_recv_wr = rx_depth,
      .max_send_sge = 1,
      .max_recv_sge = 1
      },
      .qp_type = IBV_QPT_RC,
      .sq_sig_all = 0
      };

      When posting a Send Request(s), you need to specify the Send Requests that will generate the Work Completion
      (by setting the IBV_SEND_SIGNALED flag):

      struct ibv_send_wr wr = {
      .wr_id = PINGPONG_SEND_WRID,
      .sg_list = &list,
      .num_sge = 1,
      .opcode = IBV_WR_SEND,
      .send_flags = IBV_SEND_SIGNALED,
      };

      I hope that it helped.
      I guess that I'll write a post this weekend on selective signalling..

      Thanks
      Dotan

      • Omar Khan says: June 26, 2014

        Dear Dotan

        Thanks for your reply. I set send_flags = IBV_SEND_SIGNALED for those send requests for which completion entry is required. What about those for which completion entry in CQ is not required? Do I set the send flag = 0

      • Omar says: June 26, 2014

        Dear Dotan

        I have tried what you have said about setting .sq_sig_all = 0 and only using .send_flags = IBV_SEND_SIGNALED for those send requests which i need to signal. For those send requests whose completion notification is not required, I set .send_flags = 0. I have also set the .max_send_wr = 1 before creating the queues. But it does not work. If i set the .sq_sig_all = 1 and poll the send completion queue after every ibv_post_send, it works very well but i get a delay of several microseconds.
        Please help me out in this.

        Regards

      • Omar says: June 26, 2014

        Selective signalling works. All we need to do is signal one WR for every SQ-depth worth of WRs posted. For example, If the SQ depth is 16, we must signal at least one out of every 16. This ensures proper flow control for HW resources.
        Courtesy: section 8.2.1 of the iWARP Verbs draft http://tools.ietf.org/html/draft-hilland-rddp-verbs-00#section-8.2.1

        Regards

        Omar Khan

      • Dotan Barak says: June 26, 2014

        Hi Omar.

        I'm happy that it is working for you and thanks for the URL that you shared.

        Thanks
        Dotan

  4. Aunn Raza says: November 6, 2014

    Hi Dotan,
    What if the CQ has 2 entries, but i take only 1 entry by ibv_poll_cq, Will it generate another notification for other one when i will poll it again? or i have take both the entries together?

    • Dotan Barak says: November 6, 2014

      Hi Aunn.

      The question is what do you mean by "notification".
      If you are talking about Completion Notification,
      then the next Work Completion that will be added to the CQ will generate Completion event
      (if you asked to get this notification from the first place).

      This notification will happen when a new Work Completion is added to the CQ,
      and it doesn't matter if the CQ is empty or not.

      I hoped that I answer to your question.

      Thanks
      Dotan

  5. Valentin Petrov says: December 16, 2014

    Hi, Dotan, could you possibly give a hint (maybe somewhere in the literature) on how to organize flow control when a single RCQ (recv completion queue) is shared among multiple QPs. The issue i have is the following. I do maintain necessary level of pre-posted recv WRs in all QPs so that there is no dropped packets. This is easy to do on per-connection (per QP) basis since everybody knows how many recvs are preposted on the other side. But the shared RCQ can be easily overflown in case its depth < N*num_preposted (N - number of connections). I beleive there should be a "gold/commonly_adopted" algorithm for this scenario. Can u suggest anything here?

    • Dotan Barak says: December 18, 2014

      Sorry, there isn't such algorithm that I'm aware of..
      If you'll develop one, it will be great if you'll share it.
      :)

      You need to be careful not to overflow the CQ, and if needed work with several CQs;
      make sure that if you have X QPs that every QP may get Y Work Completion, the CQ size must be bigger than X * Y.

      If there can be a case where the CQ won't be big enough, you should use multiple CQs.
      Working with Completion Events and an event channel that handle multiple CQs can be useful too.

      Dotan

  6. Starichok says: December 25, 2014

    Hi! Please, help me!
    I can't get any events at the receiving side.
    Although I see from the debugger that the contents of the receive buffer has changed. On the server side ibv_poll_cq always return 0. If I use ibv_get_cq_event, then the program will be blocked forever.
    Pseudocode:
    - Client side:
    - ibv_post_send() with IBV_SEND_SIGNALED and opcode=IBV_WR_RDMA_WRITE;
    - ibv_poll_cq;
    - Server side:
    - ibv_poll_cq;

    Trying .sq_sig_all = 0 and .sq_sig_all = 1, but the result on server side is the same.
    What am I doing wrong?

    • Dotan Barak says: December 25, 2014

      Hi.

      Let me try to understand what is going on:
      In the client side, you post a Send Request an RDMA operation,
      and poll for Work Completion (i.e. poll_cq return a value which isn't 0, and fill a Work Completion structure).

      However, in the server side you don't get any completion at all - right?

      Since you are using RDMA Write, you shouldn't get any Work Completion in the receiver side at all
      (this is the whole idea of RDMA).

      If you want to get a Work Completion in the receiver side, I suggest that you'll:
      1) post a Receive Request at the server side
      2) Use RDMA Write with immediate, which will consume the Receive Request in the receiver side and generate a Work Completion.

      I hope that this helped you.

      Thanks
      Dotan

      • Starichok says: December 26, 2014

        Thank you very much!!! Today did as you said - it all worked perfectly!!!

        "Since you are using RDMA Write, you shouldn't get any Work Completion in the receiver side at all
        (this is the whole idea of RDMA)."
        Sorry for the boring, but
        how, then, can be found on the remote side that its buffer data were recorded, in addition to my case and the TCP/IP socket?

      • Starichok says: December 26, 2014

        or so - what is the best way to learn about it

      • Dotan Barak says: December 28, 2014

        Hi.

        I didn't really understand the question here.
        But I'll try to explain what I think you meant:
        The sender side perform RDMA Write to the receiver memory,
        and he should hint the receiver that its memory was changed.

        This can be done by sending Send or RDMA Write with immediate operations.
        One may ask: what this is good for?
        Well, the sender can issue several RDMA Write to the receiver memory and hint the receiver only once about all the written memory buffers.

        This blog is a good place to start learning RDMA from.
        Currently, there isn't any "Getting started" post, but I'll guess that I'll write such in the (near?) future.

        Thanks
        Dotan

      • Starichok says: December 30, 2014

        Hi!
        Thank you very much!!!
        I have achieved transfer rate by 65 KB (interface QDR) about 8 Gbit/s using one QP and four buffers !!!
        Happy New Year !!!

      • Dotan Barak says: December 31, 2014

        Nice...
        (This is a very good start)

        Happy new year
        Dotan

  7. Parthiban says: January 2, 2015

    Hi Dotan,
    Happy New Year!!

    I'm trying RDMA transfer between two nodes and I observe no work completion WU in the queue. The same application works between two adjacent nodes but when i try to run across the network nodes i observe the above mentioned error.
    Then i checked the ibv_rc_pingpong or ibping test, i see the remote address are shared but the transfer didn't happen. But the normal ping to remote node is working fine.

    Thanks,
    Parthiban

    • Dotan Barak says: January 2, 2015

      Hi Parthiban.

      I need some more information:
      Which transport are you using (InfiniBand, RoCE, iWARP)?
      Can you send me the output of ibv_devinfo?

      Thanks
      Dotan

  8. Parthiban says: January 2, 2015

    Hi Dotan,
    Thanks for the reply. I'm using InfiniBand.

    system 1:
    hca_id: mlx4_1
    transport: InfiniBand (0)
    fw_ver: 2.10.630
    node_guid: 0025:90ff:ff17:0448
    sys_image_guid: 0025:90ff:ff17:044b
    vendor_id: 0x02c9
    vendor_part_id: 4099
    hw_ver: 0x0
    board_id: SM_2191000001000
    phys_port_cnt: 1
    port: 1
    state: PORT_ACTIVE (4)
    max_mtu: 2048 (4)
    active_mtu: 2048 (4)
    sm_lid: 31
    port_lid: 4
    port_lmc: 0x00
    link_layer: IB

    hca_id: mlx4_0
    transport: InfiniBand (0)
    fw_ver: 2.32.5100
    node_guid: f452:1403:008c:3d80
    sys_image_guid: f452:1403:008c:3d83
    vendor_id: 0x02c9
    vendor_part_id: 4099
    hw_ver: 0x0
    board_id: MT_1090120019
    phys_port_cnt: 2
    port: 1
    state: PORT_ACTIVE (4)
    max_mtu: 2048 (4)
    active_mtu: 2048 (4)
    sm_lid: 1
    port_lid: 1
    port_lmc: 0x00
    link_layer: IB

    port: 2
    state: PORT_DOWN (1)
    max_mtu: 2048 (4)
    active_mtu: 2048 (4)
    sm_lid: 0
    port_lid: 0
    port_lmc: 0x00
    link_layer: IB
    System 2:
    hca_id: mlx4_0
    transport: InfiniBand (0)
    fw_ver: 2.32.5100
    node_guid: f452:1403:008e:e9b0
    sys_image_guid: f452:1403:008e:e9b3
    vendor_id: 0x02c9
    vendor_part_id: 4099
    hw_ver: 0x0
    board_id: MT_1090120019
    phys_port_cnt: 2
    port: 1
    state: PORT_ACTIVE (4)
    max_mtu: 2048 (4)
    active_mtu: 2048 (4)
    sm_lid: 19
    port_lid: 1
    port_lmc: 0x00
    link_layer: IB

    port: 2
    state: PORT_ACTIVE (4)
    max_mtu: 2048 (4)
    active_mtu: 2048 (4)
    sm_lid: 1
    port_lid: 24
    port_lmc: 0x00
    link_layer: IB

    • Dotan Barak says: January 2, 2015

      Hi.

      Which IB port did you try to wirk with 1 or 2?

      (Since i think that port 1 of the devices isn't managed by the same SM)

      Thanks
      Dotan

      Thanks
      Dotan

  9. Parthiban says: January 2, 2015

    Yes you are right! there are again two separate IB networks the systems are connected to. I use port 2. one more doubt! if the two ports are connected to different IB network and the same system is configured to run the SM for the two network, will it work properly for both the networks?

    Thanks,

    • Dotan Barak says: January 2, 2015

      Hi.

      If you use the same SM for two networks, it becomes one subnet.

      It you have two subnets (for example, all port 1 in one subnet and all port 2 in the second one), working with port 1 in different machines will communicate (same goes with port 2).

      Thanks
      Dotan

  10. Parthiban says: January 3, 2015

    Hi Dotan,
    I see that

    system001:~ # ibv_rc_pingpong
    local address: LID 0x001b, QPN 0x340049, PSN 0xa06196, GID ::
    remote address: LID 0x0001, QPN 0x60004a, PSN 0xbb7261, GID ::

    system002:~ # ibv_rc_pingpong 192.168.96.101
    local address: LID 0x0001, QPN 0x60004a, PSN 0xbb7261, GID ::
    remote address: LID 0x001b, QPN 0x340049, PSN 0xa06196, GID ::
    Failed status transport retry counter exceeded (12) for wr_id 2

    and

    system001:~ # ibping -S -d -v
    ibdebug: [12314] ibping_serv: starting to serve...

    system002:~ # ibping -d -v 14
    ibdebug: [6738] ibping: Ping..
    ibwarn: [6738] ib_vendor_call_via: route Lid 14 data 0x7fff4c7b8c10
    ibwarn: [6738] ib_vendor_call_via: class 0x132 method 0x1 attr 0x0 mod 0x0 datasz 216 off 40 res_ex 1
    ibwarn: [6738] mad_rpc_rmpp: rmpp (nil) data 0x7fff4c7b8c10
    ibwarn: [6738] mad_rpc_rmpp: MAD completed with error status 0xc; dport (Lid 14)
    ibdebug: [6738] main: ibping to Lid 14 failed

    not able to figure out the reason.

    Thanks
    Parthiban

    • Dotan Barak says: January 4, 2015

      Hi.

      First of all, In system001, ibv_rc_pingpong prints that the local LID is 0x1b (27 decimal),
      bu when you executed ibping you used LID 14.

      The above failure in ibv_rc_pingpong suggests that there is connectivity problem in your subnet.
      Are they both in the same subnet now?

      Thanks
      Dotan

      • Parthiban says: January 4, 2015

        Hi Dotan,
        Yes, both the systems are in same network. If i execute normal ping it works fine. Another scenario is that if I run the RDMA sample application which uses RDMA CM the application is working fine but if use IB verbs it fails with completion wasn't found in the CQ and poll completion failed.
        Thanks

  11. Parthiban says: January 5, 2015

    Hi Dotan,
    The issue is fixed, actually the bug is the program scans the interfaces and tries to use the interface found first, but that interface is not connected to the same subnet. Now I pass the interface to use and it works!
    Thanks,
    Parthiban

    • Dotan Barak says: January 5, 2015

      Great!

      Thanks for updating me.

      Dotan

    • yuzhen says: September 13, 2016

      Hi Parthiban,

      I also tried to run the example provided using IB verbs, but it failed with the same error like yours. "completion wasn't found in the CQ after time out. poll completion failed".

      Do you have any suggestions?

      Thanks

      • Dotan Barak says: September 16, 2016

        Hi.

        Which example did you try to use?
        What is the exact command line and the output that you got?

        Thanks
        Dotan

  12. Anonymous says: January 14, 2015

    Hi Dotan!
    I use the QDR device. How do I use all 4 tires? Experimentally, I found that all clients use a single bus :(. If I run one client the maximum transmission speed is 10 GB/s, if I run 4 client, then the total transfer speed is equal to 10 GB/s, and each client can transmit at 2.5 GB/s...
    How can I fill the entire bandwidth, i.e., 40 GB/s???
    Thanks!

    • Dotan Barak says: January 14, 2015

      Hi.

      QDR means that the speed of the speed of the line is 4 times faster than the base speed.
      Base speed: SDR is 2.5 Gb/s.

      Please execute 'ibstat | grep Rate' to get the maximum supported BW for your adapter.
      (assuming that you are using InfiniBand)

      Thanks
      Dotan

  13. Floaterions says: February 17, 2015

    Hello Dotan,

    When program is waiting at ibv_poll_cq(), does it consumes CPU, or does it go idle and wait for an event to wake it up? I'm asking this because I'm now facing a design choice, where I can end up with hundreds of threads (more than cpu cores), each polling on a separate QP for messages, and I was wondering if the waiting threads actually incur any cost to the system.
    Thank you for your help

    • Dotan Barak says: February 17, 2015

      Hi. Floaterions.

      When ibv_poll_cq() is called, it consumes CPU (i.e. polling).

      If you want to reduce the CPU consumption (and latency isn't an issue),
      it is preferred to work with Completion events.

      Thanks
      Dotan

  14. DjvuLee says: March 9, 2015

    HI, Dotan!

    I have a question is that I want to try what will happen if the ReceiveRequest is not ready in the receiver node(also RNR).

    so I just post one ReceiveRequest in the receive node, and the Sender will send several SendRequests through a loop. I hope there will occurs a IBV_WC_RNR_RETRY_EXC_ERR error in the second loop.

    The first loop is just as me expected, the receiver received the SendRequest and consume the ReceiveRequest, however in the second loop, the receiver get a event(ibv_get_cq_event), however the following ibv_poll_cq get zero, and blocked in the ibv_get_cq_event again.

    this seem impossible, because there is a event notify from the completion queue, however the poll get nothing. How this happened?

    • DjvuLee says: March 9, 2015

      Oh, I am a liitle sorry Dotan. There is some mistake in my last post.

      Every SendRequest used the signal, and it is the sender get a event notify using ibv_get_cq_event, but get 0 using ibv_poll_cq.

      and the receiver just blocking in the ibv_get_cq_event, no error message is throwed out.

      • Dotan Barak says: March 10, 2015

        Hi.

        Yes, in RDMA you may get a Completion Event without finding a Work Completion in the Completion Queue
        (I've wrote about it in my posts).

        Some questions:
        * Are you using Reliable transport types for the Queue Pair?
        * If you switch to polling instead of using events do you still have a problem?
        * Do you check the status of the Work Completions (in both sides)?
        * what is the value of the following attributes: min_rnr_timer, rnr_retry, timeout, retry_cnt?

        Thanks
        Dotan

  15. DjvuLee says: March 11, 2015

    Thanks very much! I will search your blog to see this.
    I use the RC. I modify my code later, and there is some mess, so I have to restore my code and check this status later.

  16. DjvuLee says: March 12, 2015

    Hi Dotan! I have a question about the concurrency connection setup.

    If I have a server which will accept a lot of clients.

    On the connection setup stage, support we get a RDMA_CM_EVENT_CONNECT_REQUEST event from one client, and then a RDMA_CM_EVENT_CONNECT_REQUEST from another client, and then a RDMA_CM_EVENT_ESTABLISHED event.

    Because we use the same event channel, and we can not get the connection id when we get the RDMA_CM_EVENT_ESTABLISHED event, so which client got established?

    I thought maybe RDMA deal with another way: If we get a RDMA_CM_EVENT_CONNECT_REQUEST event, we will reject the connection request from other client until we get the RDMA_CM_EVENT_ESTABLISHED for the former client, but if the server failed to get RDMA_CM_EVENT_ESTABLISHED for this client, what will lead to? Other clients will be rejected forever.

    Or we should use different event channel for different client, which seems not a good way.

    I write a program which use the main thread for the connection setup from RDMA_CM_EVENT_CONNECT_REQUEST to RDMA_CM_EVENT_ESTABLISHED, after the RDMA_CM_EVENT_ESTABLISHED event, we dispatch the setup connection to another thread, and use the main thread to accept the new connection. But when I use some clients to connect the server simultaneously,only one get serviced, the other is rejected. In TCP/IP, this is the right way for concurrency connection.

    And I also wonder how to get which client disconnected when we receive RDMA_CM_EVENT_DISCONNECTED, since we can not get the connection id from the event .

    I have little RDMA programming experience, so I hope this problem not stupid enough.

    • Dotan Barak says: March 17, 2015

      Hi.

      I'm sorry, but I don't consider myself an expert (yet) in programming over librdmacm.
      There is an example in the rdmacm git repository, called rping.

      This example has a persistent mode, and I think that all your questions will be answered from this example.
      Please pay attention to the function rping_run_persistent_server().

      If you care about specific clients, maybe you can use the private_data field to exchange important information about the remote identity.

      I hope that this helps you
      Dotan

      • DjvuLee says: March 21, 2015

        Thanks Dotan.

        I kown how to deal with this now. I just using one thread to listen the EVENT, use the connectionId to relate different events, and dispatch the connection to the thread pool.

      • Dotan Barak says: March 29, 2015

        Cool.

        Thanks for the update
        Dotan

  17. Avis says: July 17, 2015

    Hi Dotan,
    I see a behavior where completion event (for receive) is triggered, but when I poll the cq (ib_poll_cq), it returns 0 work completions. Why would a completion event be generated when there are no work completions ?. Is this a normal behavior, if not where do you suspect the problem could be ?

    • Dotan Barak says: July 17, 2015

      Hi.

      Yes. Completion events can be triggered even if there isn't any Work Completion in the Completion Queue.
      This can happen if you armed the CQ, emptied the CQ (thus polling the Work Completion that triggered the event). When you'll read the event and and check the CQ, you may find the CQ is empty.

      I believe that if you'll check, you'll find that all the Work Completion were read from the CQ before you got this event with empty CQ.

      Thanks
      Dotan

      • Avis says: July 17, 2015

        Thank you.

  18. Anon says: July 22, 2015

    Hi Dotan,

    I am trying to create an example of a one sided RDMA READ off the rc_pingpong.c sample from the ibverbs code.
    What I have changed is:
    1. When creating the memory regions, allow remote reads through the IBV_ACCESS_REMOTE_READ flag.
    2. The pp_post_send function to use IBV_WR_RDMA_READ as the opcode.
    3. Removed all calls to pp_post_recv.
    4. Changed the main while loop, so that the server and the client both poll the cq. Once an event has happened, they exit. Particularly, the server will keep running, and the client exits after it does one run of pp_post_send.

    The issue that I am seeing is that on the client side, the work completion returns code IBV_WC_REM_INV_REQ_ERR.
    Do you know why this might be? It seems that the qp_access_flags is not used anymore (? or at least when I try and set them, they don't get modified) and the buffers in the pingpong context are still the same 4KB page size. With the permissions set on the memory regions, I am not sure what else is going wrong?

    Thanks for any help

    • Dotan Barak says: July 22, 2015

      Hi Anon.

      Did you enable RDMA_READ in qp_attr.'qp_access_flags'?

      Thanks
      Dotan

      • Anon says: July 23, 2015

        Hi Dotan,

        I eventually figured it out. I was setting the qp_access_flags to allow IBV_ACCESS_REMOTE_READ.
        The issue is that I misinterpreted what max_dest_rd_atomic and max_rd_atomic fields were used for -- I thought it was only for remote atomic operations. As such, I set them to 0. So when I tried to modify the QP state machine to RTR, the access flags simply didn't update.

        Thanks for the help.

      • Dotan Barak says: July 23, 2015

        I'm glad everything is working for you
        :)

        Dotan

  19. Adrian says: September 14, 2015

    Hello Dotan,

    I am trying to figure out what the maximum number of scatter/gather entries I can use per one Work Request is.
    I have read the FAQ on the ibv_create_qp page, however, I am not seeing the failure when I am trying to create the QP.
    What I have is:
    1. ibv_query_device returns a max_sge value of 32.
    2. I use this in the max_send_sge field of the ibv_qp_cap struct in the ibv_qp_init_attr struct used to create the QP. When the function returns, the value in max_send_sge is updated to 62, to my surprise (I am not sure why...)
    3. I then attempt an RDMA READ with an sg_list of length 32, and 31. Each scatter/gather entry has length 1 (i.e. I am reading only one byte from the remote buffer into the local one for each entry). Both of these return IBV_WC_LOC_LEN_ERR as the completion.
    4. If I use an sg_list of length 30, everything seems to work.

    Do you know why:
    a) ibv_create_qp modifies the max_send_sge to be larger than the max_sge value returned from ibv_query_device?
    b) The max_sge value seems to be too large, even though creating the QP with that value set in the init attributes returns with no error?

    Thanks in advance.

    • Dotan Barak says: September 15, 2015

      Hi Adrian.

      The problem that the ibv_query_device provides one value for max_send_sge for all transport types, for all Work Queues (both Send and Receive),
      and sometimes this just enough.

      I suspect that there is a bug in the low-level driver and you should use the latest version of it,
      and inform the low-level driver provider if this still happens.

      Thanks
      Dotan

  20. Anonymous says: September 14, 2015

    Hi Dotan,
    When posting a signaled Rdma Send from server side i receive no WC at the client side and the kernel hangs. Even though i have some outstanding receives work requests posted on the client side. Can you tell me what the reason could be?

    • Dotan Barak says: September 15, 2015

      Hi.

      Signaled RDMA Sends are relevant only to the local side (the remote side isn't aware to the signalling mode).
      Is this is the first message? (maybe the QPs weren't connected correctly).

      Thanks
      Dotan

  21. Long says: February 4, 2016

    Hi Dotan,

    Thanks for your web site that provides a lot of useful information about RDMA and Infiniband.

    I want to use an example program provided by Tarick Bedeir (https://thegeekinthecorner.wordpress.com/) for setting up an RDMA connection (using RDMA_CM) between two machines and then call
    ibv_post_send()/ibv_post_recv() to send/receive data. Setting up the RDMA connection works fine. However, ibv_post_send() fails on the first attempt to send (I get error IBV_WC_RETRY_EXC_ERR (12)).

    Your article on ibv_poll_cq says that "this mean that the connection attributes are wrong or the remote side isn't in a state that it can respond to messages". However, this example program sets the retry_count parameter to 7 (infinite retry). Further, the program uses rdma_create_qp() to create the queue pairs and the RDMA programming manual says that "QPs allocated to an rdma_cm_id are automatically transitioned by the librdmacm through their states. After being allocated, the QP will be ready to handle posting of receives."

    I wonder what goes wrong and how I can fix it. The example code is available at
    https://github.com/tarickb/the-geek-in-the-corner/tree/master/02_read-write .

    I would be grateful if you have any suggestion for me.

    Thanks,
    Long

    • Dotan Barak says: February 5, 2016

      Hi Long.

      I didn't have a chance to play with this example (yet).
      7 is infinite value only for the rnr_retry.
      For retry_cnt 7 is actual a seven retries.

      I would suggest to try executing ibv_rc_pingpong an rping,
      to check that your fabric is configured and functioning correctly.

      Thanks
      Dotan

  22. liuyu says: March 15, 2016

    Hi Dotan,

    Thanks for your help!

    Now I am very confused when I use verbs to programing. I want to use rdma read or rdma write for handling IO, but I get err IBV_WC_REM_INV_REQ_ERR(9) at the sender side. I have checked the mem, i didn't find something wrong. I paste some code here, could you give me some suggestion ?

    //create qp
    RCPRINT("client creating qp\n");
    qp_attr.cap.max_send_wr = MAX_WR;
    qp_attr.cap.max_send_sge = 1;
    qp_attr.cap.max_recv_wr = MAX_WR;
    qp_attr.cap.max_recv_sge = 1;
    qp_attr.send_cq = send_cq;
    qp_attr.recv_cq = recv_cq;
    qp_attr.qp_type = IBV_QPT_RC;
    err = rdma_create_qp(cm_id, pd, &qp_attr);
    if (err)
    {
    RCPRINT_ERROR("client create qp fail\n");
    clientDestroyRdmaObj(connection);
    return 1;
    }

    //rdma write or read
    memset(&send_wr, 0, sizeof(send_wr));
    send_wr.wr_id = (uint64_t)sge;
    send_wr.sg_list = sge;
    send_wr.num_sge = 1;
    send_wr.opcode = (opCode == CTRL_READ) ? IBV_WR_RDMA_READ : IBV_WR_RDMA_WRITE;
    send_wr.send_flags = IBV_SEND_SIGNALED;
    send_wr.wr.rdma.remote_addr = remoteRdma->remote_addr;
    send_wr.wr.rdma.rkey = remoteRdma->rkey;

    if (ibv_post_send(connection->cm_id->qp, &send_wr, &bad_wr))
    {
    RCPRINT_ERROR("server send rdma opt(%d) fail\n", opCode);
    return RETURN_ERROR;
    }

    • liuyu says: March 17, 2016

      Hi Dotan,

      Today I write a test program, I found that client could post IBV_WR_RDMA_WRITE or IBV_WR_RDMA_READ successfully, but server only could post IBV_WR_RDMA_WRITE successfully. When server post send with op IBV_WR_RDMA_READ, it get error IBV_WC_REM_INV_REQ_ERR(9) after ibv_poll_cq successfully, and wc.opcode change to IBV_WC_SEND. In the test program , I just send an IBV_WR_RDMA_READ. could you give me some suggestion?

      • Dotan Barak says: March 22, 2016

        Hi.

        I think that the problem is with the permission of the QP or MR in the client side.
        (RDMA Read isn't enabled)

        Thanks
        Dotan

  23. liuyu says: March 29, 2016

    Thanks very much for your help! I have resolved the problem. When the server calls rdma_accept, I do not assign value to struct rdma_conn_param's member initiator_depth which value is zero default. So struct ibv_qp_attr's member max_rd_atomic is zero also. The server cannot send RDMA_READ operation nerver.

  24. tamlok says: April 13, 2016

    I posted a Receive Request and call ibv_poll_cq() to see if we received anything. However, after I suddenly kill the program which is intended to send something to the receiver, the ibv_poll_cq() called by the received still keeps returning 0.
    So it is confused that ibv_poll_cq() doesn't return negative value even after the connection has been disconnected.
    Do you have any ideas?
    Thanks very much!

    • Dotan Barak says: April 19, 2016

      Hi.

      ibv_poll_cq() with negative value means that there is an error in the CQ.
      The QP doesn't know that the remove side is dead ...
      (unless CM is used, and there is a DISCONNECT indication)

      If needed, you can add "keep alive" messages to your application
      (for example: RDMA Write with 0 bytes - if you are using RC).

      Thanks
      Dotan

      • tamlok says: April 19, 2016

        Thanks very much! Do you know how much the cost of ibv_poll_cq() is? Will it be expensive if I keep calling ibv_poll_cq() frequently? Will it consult the hardware register or just memory?

      • Dotan Barak says: April 19, 2016

        Hi.

        It is hard to answer, since it is device specific.

        In Mellanox devices (for example), ibv_poll_cq() access memory - which is relevantly cheap
        (no context switch or any expensive operation).

        I can't say for other devices...

        Thanks
        Dotan

      • tamlok says: April 19, 2016

        Impressive! Thanks very much!

  25. Junhyun says: October 8, 2016

    Hi Dotan, when exactly is the buffer posted for recv WR updated?
    Is it updated when I call ibv_poll_cq or whenever the device accepts a next incoming send WR?
    For instance, if I posted 10 Recv WRs on a same buffer, and in other host I post 2 Send WRs, will the first Send contents be overwritten? or can I get the contents of the first Send by polling just the first Recv WR out of the CQ?

    • Dotan Barak says: October 11, 2016

      Hi.

      The Receive WR buffer(s) are filled when the incoming message arrives and the Receive Request is fetched.
      Once all the message is filled to the buffer, a Work Completion is enqueued to the CQ.

      In your example, the first message content will be overwritten by the second message.

      Thanks
      Dotan

      • Wang says: February 28, 2017

        Hi!

        Thanks for your web site.

        i used send/recv to transfer 50 bytes data over RC QP. The receiver polled a cqe with byte_len is exactly 50, and the status is IBV_WC_SUCCESS, but i cannot find data in the buffer pointed by pre-posted RR. i am very confusing what's going on...

      • Dotan Barak says: July 21, 2017

        Hi.

        I believe that either:
        1) There is a bug in the code
        2) The program overwrite the buffer (using multiple Receive Requests points to the same location or write directly to the buffer)

        Thanks
        Dotan

  26. vineeth says: October 27, 2016

    I am trying to do nvme over fabrics project(RDMA). But in RDMA , I am getting rdma read fail with status 5 in host Side.ie; qp move to error state.can you please tell me why my qp moving to error state in host side while rdma read?

    • Dotan Barak says: November 4, 2016

      Hi.

      Work Completion with status 0x5 means: IBV_WC_WR_FLUSH_ERR.
      * Is this is the first completion with error?
      * Is the QP is already in error?
      * Is there is asynchronous event in the remote side?

      Thanks
      Dotan

  27. shilvea says: February 20, 2017

    Hi!
    in case of SRQ the poll_cq is not used? I cann't understand how I can call it if input parameter is cq. But I didn't create the cq, only srq.

    • Dotan Barak says: February 22, 2017

      Hi.

      The SRQ by itself, can't be used;
      it is used by the QP(s) to hold the Receive Requests.

      The corresponding Work Completion of that Receive Request is enqueued to the QP.receive_queue
      for incoming messages.

      Thanks
      Dotan

  28. Vasilis.G says: April 21, 2017

    Hello Dotan,

    My understanding is this: When receiving an incoming UD Send, the gather region will contain the payload offset by the 40-byte grh. But the grh will only be valid if the sender is in a different subnet or the message is part of a multicast (please correct me if I am mistaken).
    My questions are the following:

    In the case of a unicast UD Send within the same subnet is the grh actually transported? I.e. can I overload the 40 bytes of the grh with actual payload, since it's going to go to my gather list in the receiver anyway?

    In case of the WR posted on the Receive Queue, is it mandatory to specify a valid, registered region in the sge list to accommodate the ghr, even if the incoming UD Send has no payload (i.e. send only the immediate value)?
    In the case of sending no payload I should be able to write num_sge=0 when specifying the receive request. Does that hold true for a Receive Request that accommodates a UD Send?

    Thank you!

    • Dotan Barak says: July 3, 2017

      Hi.

      You are correct, for a UD QP the packet payload will be placed starting offset 40 in the Receive Request buffer;
      the first 40 bytes will contain a GRH only if the packet contained a GRH.

      When posting a Send Request of a UD QP, the sender controls, in the Address Handle, whether or not a GRH will be sent over the wire.
      You cannot control the context of the 40 bytes of the GRH; most of the GRH is filled automatically by the RDMA device.

      If a GRH isn't present, I *believe* (i.e. I didn't verify it) that you don't have to provide a valid MR key,
      since I have a feeling that only before the RDMA device write content to memory it validates the S/G, provided in the Receive Request.

      The spec is writing: "Note that for UD QPs, the first 40 bytes of the buffer(s) referred to by the Scatter/Gather list will contain the GRH of the incoming message. If no GRH is present, the contents of first 40 bytes of the buffer(s) will be undefined."

      The behavior is implementation dependent; different vendors/devices may behave differently.
      I suggest not to count on the implementation of specific RDMA device and always provide a valid MR.

      Thanks
      Dotan

  29. Erfan Zamanian says: December 14, 2017

    Hi Dotan,

    Is ibv_poll_cq a blocking function? for example, if the CQ is empty, would ibv_poll_cq return 0 immediately, or would it block and only sporadically returns 0?

    • Dotan Barak says: December 22, 2017

      Hi.

      ibv_poll_cq() isn't a blocking function and it will always return immediately.

      Negative return value in case of an error.
      Otherwise, the number of Work Completions returned (0 means that no Work Completions were found in that CQ).

      Thanks
      Dotan

  30. HeBoxin says: January 22, 2018

    Hi,
    There is a problem, when the sender endpoint post a send request, and then meet a "RNR retry counter exceeded" completion (because at that time I don't set a receive request in the remote endpoint), then after I set a receive request in the remote endpoint, I let the sender endpoint post a send request again ,However, it occurs Work Request Flushed Error.Could you tell me How to solve this problem. I really appreciate it.

    • Dotan Barak says: March 2, 2018

      Hi.

      Once you get a Work Completion with error in an RC QP,
      the QP is being transitioned to the Error state.

      If you want to work with that QP, you need to reestablish the connection in both sides
      (move the QP to the Reset state, and configure it in both sides to the RTS state).

      Thanks
      Dotan

  31. Long says: March 9, 2018

    Hi Dotan,

    After setting local QP to a IBV_ACCESS_REMOTE_WRITE mode and send a WQE, once the message is sent, after geting that into CQE, the local side poll_cq, a got an ibv_wc but with byte_len =0.

    As far as I understand, the byte_len should be the number of byte transferred, which means the size of the buffer sent.

    Am i missing something ?

    Thank you for your help.

    Long

    • Dotan Barak says: May 13, 2019

      Hi.

      In Completion with errors not all fields are valid.

      Thanks
      Dotan

  32. Leslie says: March 24, 2018

    I met same issue. Here is my data:
    * Is this is the first completion with error?
    Yes. It is first error
    * Is the QP is already in error?
    How do I know QP is in error or not?
    * Is there is asynchronous event in the remote
    No.
    I'm using SoftRoCE. Is it possible related?

    • Dotan Barak says: April 19, 2018

      Hi.

      You can know if a QP is in error or not by calling ibv_query_qp() and check the state field.

      Personally, I don't have any experience with SoftRoCE, but all RDMA stack should behave in the same way...

      Thanks
      Dotan

  33. Conrad says: May 13, 2018

    I am currently trying to make a simple sender/receiver setup using RDMA over infiniband (UD protocol). All the hardware is rated for 40+ Gbit/s, but i am only able to achieve around 13. From my understanding it seems that the completion polling is slowing it down. The sender was made faster by only sending with flags every 100th work request and thereby saving a lot of pollings, but on the receiving side i have to poll each request? How do i speed up the application?

    • Dotan Barak says: May 18, 2019

      Hi.

      If you want to decrease the stress on CPU, work with CQ events.
      I think that the number of outstanding WRs that you use is too low;
      I wrote a post on improving RDMA application performance - check it out.

      Thanks
      Dotan

  34. Igor Leshenko says: August 19, 2018

    Once I got WC from ibv_poll_cq() - do I have a standart API to know - what is the address of corresponding memory buffer (provided in ibv_post_recv())?

    • Dotan Barak says: August 24, 2018

      Hi.

      No.
      It is up of the SW to provide hints and use information that exists in the Work Completion to know what is the corresponding memory buffer.

      Here are ideas how this can be done:
      * If it was a SEND message, the wr_id can be useful
      * If it was an RDMA message, imm_data can be used

      Thanks
      Dotan

  35. Zhao, Bing says: September 11, 2018

    Hi Dotan,
    In your previous comments "I mean that the data was received to the remote side HCA and in almost all cases was written to its memory", I have a question (I cannot click the "reply" button due to the network reason). Do you mean that once the data arrives at the remote side in a WRITE operation, the NIC/HCA will generate a ACK if the RC mode is used? I tested the write latency on the MLNX NIC(RDMA over Ethernet) and got a little confused. In RC mode, for example, it will take about 43 cycles (a low precision counter) to get the successful status after the write request function returns. But in UC mode, it will take only 21 cycles. 2 times consumption of the time in RC mode compared to the UC mode. To my understanding, the link time consumption and the HW ACK will take little time. Then why it will cost more time in the RC mode? Many thanks.
    BR. Bing

    • Dotan Barak says: September 11, 2018

      Hi.

      There isn't any ACK in UC;
      You got a Work Completion in the sender side once the data was sent out of the port.

      You ask about the reason for the different between reliable and unreliable latency;
      it depends on many things: RDMA device implementation, path (switch/cable type + length), local/remote chipset, and more ..

      Thanks
      Dotan

  36. Zhao, Bing says: September 19, 2018

    Hi Dotan,
    Thanks a lot for your reply. I do understand "You got a Work Completion in the sender side once the data was sent out of the port" now. Maybe I didn't describe my question quite clear above. The gap of latencies of RC and UC mode is not quite big with ib_write_lat tool, only about 0.01~0.02 microsecond. I've done some modification of the tool's code. And then I calcuate each part of test loop within an iteration with MLNX NIC, X86_64 & aarch64 platforms.
    The cost of "mlx5 post send" is very small. And the "poll cq" is different between RC and UC mode, only a half in UC mode compared to the RC mode. As you say, it is due to the "ACK". And then the most of the saved cycles in the UC mode will be "wasted" in the infinity loop of waiting for the data from the remote side.
    So almost half of the "poll cq" cycles in UC mode is about waiting for the ACK from the remote. (If the poll cq drivers are almost the same for the UC and RC mode of mlx5). I just wonder why it is so long?

    B.R
    Bing

    • Dotan Barak says: September 26, 2018

      Hi.

      I'm sorry, but I don't understand what you application does.
      * is it a pingpong?
      * Is only one side post Send requests?
      * How the sender "knows" that the remote side got a response (in UC)?

      Thanks
      Dotan

  37. FANTAR says: May 13, 2019

    Hi Dotan,

    I want to establish an UD connection between a server and a client.

    - I have 128 bits GIDs of the server and the client. ( ibv_query_gid, index 0 port 1)

    - i created UD QPairs on both side and i put QPair number qp->num to a wanted value ( does it work ) I use it in
    - I use PKey index 0 on both sides : 0xFFFF
    - Qkey to a wanted value : 0x1234

    On host side i create an address handle (used in Send Work request :

    - union gid with dgid.raw[16] ( values from ibv_query_gids)

    - destqpnumber with my values ( do i need to keep the initial values generated by ibv_create_qp ?)

    Do you have an example of establishing an UD connection using GRH ?

    Thank you very much in advance

    Ramy

  38. Vinit Agnihotri says: June 24, 2019

    I am running RDMA server under centOS. I could do rdma send/recvs or rdma reads/write without any issue. However while posting RDMA read operations from server I get IBV_WC_REM_INV_REQ_ERR if and only if data size if 128k and more and as cherry on top it does not happens for every request, its pretty random.

    qp_access_flags, length, mr permissions all are correct as same code works well for sizes below 128k.

    I have no issues for posting rdma writes, but some rdma reads get into trouble.

    I tried setting IBV_SEND_FENCE while posting then error goes away, but it seems to lower throughput, any pointers/thoughts about what could be going wrong? Any help is greatly appreciated.

    I am using rdma_post_read() to post operation and rdma_reg_write() to register buffer.

    Thanks

    • Dotan Barak says: June 27, 2019

      Hi.

      What are the values of max_rd_atomic and max_dest_rd_atomic in both QPs?

      Thanks
      Dotan

  39. Vinit Agnihotri says: July 16, 2019

    Values are as follows.
    max_qp_rd_atom=21 max_res_rd_atom=387072 max_qp_init_rd_atom=21

    • Dotan Barak says: July 16, 2019

      Hi.

      What are the values in the QP context, not in the device capabilities?
      You can query the QP to get those values.

      thanks
      Dotan

  40. Vinit says: July 17, 2019

    Ahh got it, after ibv_query_qp() it returns both values as 0,0(max_rd_atomic, max_dest_rd_atomic)

    • Dotan Barak says: July 17, 2019

      Hi.

      I suggest that you'll set a non-zero value in there ...

      Thanks
      Dotan

  41. vinit says: July 18, 2019

    would you suggest using max_qp_rd_atom (from device query) to be used for qp? What impat it could put if I assign say 10 as oppose to 21 at my end? would it address my problem?

    Thanks.

    • Dotan Barak says: July 18, 2019

      For a proper operation: max_rd_atomic should be lower or equal than remote side's max_dest_rd_atomic.

      For example, if you have QP_A and QP_B:
      QP_A.max_rd_atomic <= QP_B.max_dest_rd_atomic and QP_B.max_rd_atomic <= QP_A.max_dest_rd_atomic The higher the better (to allow supporting more outstanding RDMA Reads/Atomic operations) Thanks Dotan (this should be done configured to every

  42. Vinit says: July 22, 2019

    Unfortunately I don't have any control over client, as client runs in windows domain. Only control I have is of linux based server. So is there any query I could run which can get me remote side params?

    • Dotan Barak says: July 26, 2019

      Hi.

      It is up to the SW protocol, as part of the communication manager, to exchange the supported attributes and configure the best attributes..

      Thanks
      Dotan

  43. vinit says: August 1, 2019

    Alright, then I think nothing much I can do in this case.
    I'll try setting some value at my end atleast and see how it goes.
    Thank you.

  44. Christopher R says: September 25, 2019

    Hi Dotan,

    First, thanks for the great blog!

    I am wondering what an appropriate method for measuring the latency of ib verbs is. If you have a reliable connection, would taking a timestamp, then issuing the post read/write, then spin polling for a completion, followed by another timestamp accurately capture the latency?

    What about for unreliable connection / unreliable datagram?

    Thanks

    • Dotan Barak says: September 28, 2019

      Thanks
      :)

      Hi.

      I think that the answer is "no":
      Since the flow you just describes check the latency of the RDMA device scheduling, processing and network latency.
      if you want to verify the latency of the RDMA verbs, you can check the timestamp before and after calling the verb.
      (assuming that you want to check the latency of the RDMA verbs).

      If you want to check the latency of the data, the flow you describes does it.
      BTW, there are tools for checking the BW and latency of RDMA, ib_*_lat and ib_*_bw performance tools.

      Thanks
      Dotan

  45. Nathan says: April 27, 2020

    Hi, I'm wondering how to determine the incoming message is from a normal UD Rdma Send or from a UD multicast. Thanks :)

    • Dotan Barak says: July 10, 2020

      Hi.

      An multicast message always has a GRH and the mgid[15] = 0xff,
      so you can get this information from the Work Completion and the message's GRH header.

      Thanks
      Dotan

  46. rdma is hard says: July 24, 2020

    Hello Dotan,
    Assuming that I have two machines talking over RC QPs (all connection setup is done and working properly). A sender issues an RDMA READ and marks it as signaled using IBV_SEND_SIGNALED. Assuming no other requests, I use an infinite while loop to poll one completion from the completion queue. Does the completion indicate that read is complete (i.e. data is fetched into local memory) or just that the the read request has reached the remote side.

    Thanks for your blog.

    • Dotan Barak says: July 25, 2020

      Hi.

      The IBV_SEND_SIGNALED means to generate Work Completion to the *local* CQ,
      when the processing of that Send Request ends.

      In RDMA Read, this will happen when all data that was requested to be read (from the remote side) will be received by the requestor.
      This means that now you can process the buffers, since data is now available in the local buffers.

      Thanks
      Dotan

  47. Alberto Perro says: October 27, 2020

    Hi Dotan,

    Thank you so much for this blog, it is helping me so much.
    I have set up an RC QP and I can exchange IBV_SEND/RECV messages without issues.
    When I try to use IBV_RDMA_READ/WRITE I can successfully post an SR but polling the CQ gives me error 12 for the wc.
    I am using the same QP and MR and I don't know why it is happening.

    Thanks,
    Alberto

    • Dotan Barak says: November 21, 2020

      Hi.

      Most likely that you have a sync problem between the sides;
      I suspect that the requestor posted the Send Request BEFORE the responder QP was transitioned to (at least) RTR.

      Thanks
      Dotan

Add a Comment

This comment will be moderated; answer may be provided within 14 days.

Time limit is exhausted. Please reload CAPTCHA.