Zero byte messages
RDMA supports zero byte messages, and this can be done by posting a Send Request without a scatter/gather list (i.e. a list with zero entries).
Zero byte messages can be done with the following opcodes:
- Send
- Send with immediate
- RDMA Write
- RDMA Write with Immediate
- RDMA Read
To the RDMA operations, the remote address and remote key aren't being actually used or validated, so those values don't have to contain the details of a valid remote Memory Region.
What zero byte messages are good for?
Zero byte messages can be useful in the following scenarios:
- When only the immediate data is used - This can be useful to mark a directive or a status update.
- For keep alive messages in a reliable QP - Zero byte messages of RDMA Write or RDMA Read are a good idea for a non-intrusive keep alive messages in a reliable QP: to make sure that the remote QP is still alive and functioning. If the remote QP will be offline, for example, if the QP was transitioned to Error or Reset state, or if the process was terminated or if even the node itself was rebooted, there will be a Work Completion with Retry Exceeded status. Using one of the other above-mentioned opcodes will consume a Receive Request from the remote side QP.
Comments
Tell us what do you think.
This is nice article!.
By the way, My follower said. It cause IBV_WC_LOC_LEN_ERR, when zero byte message send to peer on ConnectX-3 card.
The source is here.
http://www.nminoru.jp/~nminoru/data/201309/libibverbs-1.1.6-zero-length-send-test.diff
Could you please tell me what is missing?
Thanks.
As I wrote, in order to send zero byte messages, the s/g list should have zero entries, but in your example it has one entry:
struct ibv_send_wr wr = {
.wr_id = PINGPONG_SEND_WRID,
.sg_list = &list,
.num_sge = 1, <-------------------- this should be zero - .opcode = IBV_WR_SEND, + .opcode = IBV_WR_SEND_WITH_IMM, + .imm_data = 0, .send_flags = IBV_SEND_SIGNALED, }; Sending a scatter/gather list with value zero in the size member, actually mean send 2GB... Thanks Dotan
Hello Dotan.
Thank you for your reply. I'll check it.
And feedback it later.
From what I understand, this should have actually caused 2GB of data transfer. But it causes IBV_WC_LOC_LEN_ERR. Why ?
The reason why I ask is, that I am facing the same problem setting ie, .num_sge = 1 and .length = 0, causes IBV_WC_LOC_LEN_ERR.
The question is whether the S/G entry that you described points to a valid Memory Region space?
Thanks
Dotan
If you like RDMAmojo, support it.
Hello Dotan.
Thank you for your advice. It worked properly.
Great!
Hey Dotan.
I was wondering if you could help me understand the interaction between RDMA and CPU caches. I had the following specific question:
When a remote host reads from a server via RDMA, where does the read actually come from? I read that writes go to L3 cache. Do the reads come from L3 cache too? If so, what happens if something is in a modified state in L1 or L2 cache? Is L3 always up-to-date with L1/L2?
Thanks a lot for your time!
Hi Anuj and sorry for the late response.
The question if L3 chache is up-to-date with L1/L2 is a question that you should ask
the chipset/CPU guys.
But IMHO, the answer is yes.
Thanks
Dotan
Hey Dotan,
I noticed this from your post on ibv_post_send about sge.length:
The length of the buffer in bytes. The value 0 is a special value and is equal to 2^{31} bytes (and not zero bytes, as one might imagine)
Is this Mellanox specific or generic?
Is there a spec document which describes this? I could not find any which states this , except for a couple of forum posts on zero-byte messages.
Is there a reason why 0 is expected to mean 2gig? If so, what would my CI interpret if sge.length is set to 2147483648 (bytes) which can also be stored in uint32_t?
Thanks,
--
Hi.
This is a good question.
I searched for an answer in the InfiniBand spec, and couldn't find one.
So, I can't give you a quick answer what is the origin of this behavior.
I can think about one reason: what is the meaning on a scatter/gather entry with zero bytes?
If it is zero bytes, why did you add it in the first place?
One another reason is that 0 is actually 2GB module 2GB, so if for any scatter/gather entry length you perform a module of 2GB (the maximum size of a message in RDMA), you'll get to 0.
I further investigate it, but if will take some time though.
(BTW, all posts are moderated to prevent SPAM, so they will be seen only when they approved by me)
Dotan
If you like RDMAmojo, support it.
Dotan,
Thanks for your response.
And no problem. Please take your time. I shall keep my eyes open on this thread. :-)
Hi Dotan,
I have the same question as "rsai" has, I did you find any answer for that? and
I'm trying to do RDMA write of 5 GB of memory but i see only 1GB is getting into the remote buffer, i have assigned the sge with the allocated buffer and length to the sge.length and using only sge. kindly suggest me what could be the issue.
Thanks,
Hi Parthiban.
In general, RDMA (the protocol itself) can support up to 2 GB in one message.
RDMA devices may have lower limit.
If you need to send more data than the maximum supported value (1 GB in your example),
you can use several RDMA writes to send the local (big) buffer to the remote buffer.
Thanks
Dotan
Hi Dotan,
Thanks a lot for your response, I'm new to RDMA is there any program that i can refer to implement this?
Thanks.
Hi Parthiban.
I haven't published an "hello world" posts - yet.
A good example can be the examples/rping.c in the following URL:
.
Thanks
Dotan
Just to point out, I looked through different vendor's driver source code, and what I found out was, Mellanox are only one who consider 0 as 2 GB.
Hi.
Here is a text from the InfiniBand specifications:
9.3.3.3 DMA LENGTH (DMALEN) - 32 BITS
This field indicates the length, in bytes, of the remote DMA operation.
C9-9: For an HCA performing RDMA operations, the minimum length
specified in the DMALen field is 0; the maximum length is 2^31.
So, the value zero (in packet headers) means 2^31.
If one wishes to send a message with zero bytes, he can use a Send Request with no scatter/gather elements at all.
Thanks
Dotan
Hi Dotan,
Thanks for your pointers, not able to see the link you have posted, but i have referred the program in the below link to implement,
http://web.mit.edu/freebsd/head/contrib/ofed/libibverbs/examples/rc_pingpong.c
Thanks.