Fence
What is Fence?
In the Send Queue there are operations that only send data from a local host to a remote one and there are operations that read data from a remote host and store it locally.
Sometimes, there is a need to read data from a remote host and send all/part of it back. Obviously, this should be done in two separated Send Requests. Some operations can begin processing before their previous ones were ended and this may lead to unexpected memory content to be sent. Furthermore, reading from a remote address and then writing to it (either directly using RDMA Write and Atomic or indirectly by Send) may lead to unexpected behavior.
There is a trivial solution for the above scenario: post one Send Request for the first operation, wait until it is completed (i.e. poll its Work Completion) and then post the second Send Request.
This solution will work, but it will consume more CPU time and will provide bad latency than posting those two Send Requests at once.
Luckily, RDMA provides us a mechanism to enforce an order of Send Request processing: Fence. When there is a Fence indicator on a Send Request, its processing won't begin until all prior RDMA Read and Atomic operations on the same Send Queue have completed. This is relevant only for reliable Queue Pair (since only them supports those operations).
Fence guarantees the processing order and Completion notification of Send Requests that were posted to the same Send Queue. The order between multiple Send Queues is undefined.
How to use Fence?
Fence can be used by setting the IBV_SEND_FENCE bit in the send_flags field of the Send Request and calling ibv_post_send().
When to use Fence?
As a rule of thumb, Fence should be used when RDMA Read or Atomic operation is followed a Send, RDMA Write, or Atomic operations and the first one can be outstanding.
Here is the description of what may happen when using/not-using Fence and how it affects the Send Requests processing order:
- RDMA Read which followed by Sends, RDMA Writes, or Atomics - not setting Fence may lead to a case where the RDMA Read response will contain data that was modified by the second operation.
Using Fence will make sure that the data that will be Read by the first operation will be the original (and expected) data. - RDMA Read or Atomic operations, which followed by Sends, RDMA Writes, or Atomics - not setting Fence may lead to a case that if the first operation complete in error on the initiator side (because its ACK fail to return, local protection error when writing the data or any other reason) and the second operation could still be observed by the target, and it may even cause data to be written in the target's memory. Using Fence will prevent the second operation to be observed by the target if the first one fails.
Comments
Tell us what do you think.
Quoting from youe post: "When there is a Fence indicator on a Send Request, its processing won’t begin until all prior RDMA Read and Atomic operations on the same Send Queue have completed."
What about prior Send or RDMA Write targeting the same location? Will Fence serialize requests in such a case? Or there's no need in Fence, as sends/writes are fully serialized anyway, and only reads can be "reordered" or "interleaved" with writes?
Hi Igor.
The problems exists only with RDMA Reads and Atomic operations since they
change the content of the local memory.
For example, if one use RDMA Read remote memory buffer which followed by Send (and the Send uses the content which is being read by the RDMA Read),
there isn't any guarantee (without using Fence) about the content of the buffers to be sent.
Further more, the send may write to the memory which is being read using the RDMA Read (there will be a race about the content that will be read).
Send and Write operations don't cause any problem and the sent data over the wire is very deterministic.
Did I answer your questions?
Thanks
Dotan
the send content is
Ok, I see now...
So, just to clarify: consider two RDMA Writes issued one immediately after another, the both target the same remote memory range. The former writes series of '1', the latter series of '2'. Can we assume that the whole region will eventually contain '2' (despite possible packet retransmissions performed by RC, etc.)?
Thank you very much.
AFAIK, the answer is "yes" since all packets are in the same Queue (Send Queue).
If the second message packets (with the '2') would have magically pass the first message packets (with the '1') they would have dropped because they were Out of Sequence packets.
If Read was involved, only Read Request would have sent from the Send Queue.
(this is why other messages such as Send and Write can continue be sent and change the memory content in remote side).
However, you cannot know that the data is available in remote side until there is a Work Completion in the remote's Receive Queue or a memory element was changed by an atomic operation.
Thanks
Dotan
Thanks for the great explanation.
Regarding the last sentence: if the sender issued RDMAWrite request with a long message A, and then immediately RDMAWrite request with, lets say, 1-byte message B, then if the receiver discovers message B (eg. by polling the appropriate memory location), he can know that message A is already there (as opposed to polling last byte of A, which is not guaranteed to arrive last). Is that correct?
Theoretically yes, however there isn't guarantee that if the last bye of the message was written to memory,
the rest of it exists in it.
It is highly recommended to work with Work Completions to know that the content exists in memory
(for example, send the second message with RDMA Write with Immediate - which will consume a RR in the responder side thus create a Work Completion as well).
Thanks
Dotan
That's what I meant - if the last byte arrived, it's not guaranteed that the whole message arrived; but if the *next* message is detected, the previous one *is* guaranteed to be ready - right? (for the case when polling completions is undesired)
I guess the answer in most cases the answer will be yes..
(I can't think about a scenario where it won't happen)
Thanks
Dotan
Thanks!
The last question in this series :) - in RC, why could the last byte of a message be written to the destination memory before some previous ones? Isn't proper ordering guaranteed by RC? Besides, could it happen that some location (byte) is being written twice (under any theoretical circumstances)?
No problem, you are welcome to ask me any question and I'll try to answer it
:)
I assume that a byte won't be written twice.
However, you look at the last byte of the message and you *assume* that the data is written to memory by its order (from lower address to higher address).
And what if (I don't say that this is the case) there is a behavior in the HCA or in the CPU (optimization, feature, bug, whatever the reason is) that causes DMA'ed data to be written from the last byte in a block to the first one?
I believe that the spec gave some freedom to the HCA vendors.
Let's look in the future when not all the memory pages will be pinned to memory. Let's assume that a message will be written to several pages: the first page is in the disk and the reset of the pages are present in memory. An optimization can be to write the message to the memory pages that exist in memory until the missing page will be loaded to memory, instead of dropping packets.
I hope that this gave you an answer.
Thanks
Dotan
I had problems with RDMA read followed a RDMA write and found the cause. The PCIe specification requires switches to allow writes to pass reads to the same target. It seems to only actually happen when the bus gets busy and commands get queued up in the switches.
Regards,
Mark
Hi Mark.
Did you try to set the FENCE indication for the RDMA Write?
This way, the processing of the RDMA Write won't start until the RDMA Read before it will end;
no matter what the ordering rules of the PCIe are.
Thanks
Dotan