ibv_reg_mr()
Contents
struct ibv_mr *ibv_reg_mr(struct ibv_pd *pd, void *addr, size_t length, enum ibv_access_flags access); |
Description
ibv_reg_mr() registers a Memory Region (MR) associated with a Protection Domain. By doing that, allowing the RDMA device to read and write data to this memory. Performing this registration takes some time, so performing memory registration isn't recommended in the data path, when fast response is required.
Every successful registration will result with a MR which has unique (within a specific RDMA device) lkey and rkey values.
The MR's starting address is addr and its size is length. The maximum size of the block that can be registered is limited to device_attr.max_mr_size. Every memory address in the virtual space of the calling process can be registered, including, but not limited to:
- Local memory (either variable or array)
- Global memory (either variable or array)
- Dynamically allocated memory (using malloc() or mmap())
- Shared memory
- Addresses from the text segment
The registered memory buffer doesn't have to be page-aligned.
There isn't any way to know what is the total size of memory that can be registered for a specific device.
The argument access describes the desired memory access attributes by the RDMA device. It is either 0 or the bitwise OR of one or more of the following flags:
IBV_ACCESS_LOCAL_WRITE | Enable Local Write Access: Memory Region can be used in Receive Requests or in IBV_WR_ATOMIC_CMP_AND_SWP or IBV_WR_ATOMIC_FETCH_AND_ADD to write locally the remote content value |
IBV_ACCESS_REMOTE_WRITE | Enable Remote Write Access: Memory Region can be access from remote context using IBV_WR_RDMA_WRITE or IBV_WR_RDMA_WRITE_WITH_IMM |
IBV_ACCESS_REMOTE_READ | Enable Remote Read Access: Memory Region can be access from remote context using IBV_WR_RDMA_READ |
IBV_ACCESS_REMOTE_ATOMIC | Enable Remote Atomic Operation Access (if supported): Memory Region can be access from remote context usingĀ IBV_WR_ATOMIC_CMP_AND_SWP or IBV_WR_ATOMIC_FETCH_AND_ADD |
IBV_ACCESS_MW_BIND | Enable Memory Window Binding |
If IBV_ACCESS_REMOTE_WRITE or IBV_ACCESS_REMOTE_ATOMIC is set, then IBV_ACCESS_LOCAL_WRITE must be set too since remote write should be allowed only if local write is allowed.
Local read access is always enabled for the MR, i.e. Memory Region can be read locally usingĀ IBV_WR_SEND, IBV_WR_SEND_WITH_IMM, IBV_WR_RDMA_WRITE, IBV_WR_RDMA_WRITE_WITH_IMM.
The requested permissions of the memory registration can be whole or subset of the operating system permission of that memory block. For example: read only memory cannot be registered with write permissions (either local or remote).
A specific process can register one or more Memory Regions.
Parameters
Name | Direction | Description |
---|---|---|
pd | in | Protection Domain that was returned from ibv_alloc_pd() |
addr | in | The start address of the virtual contiguous memory block |
length | in | Size of the memory block to register, in bytes. This value must be at least 1 and less than dev_cap.max_mr_size |
access | in | Requested access permissions for the memory region |
Return Values
Value | Description | ||||
---|---|---|---|---|---|
MR | Pointer to the newly allocated Memory Region. This pointer also contains the following fields:
Those values may be equal, but this isn't always guaranteed. |
||||
NULL | On failure, errno indicates the failure reason:
|
Examples
Register a MR to allow only local read and write access and deregister it:
struct ibv_pd *pd; struct ibv_mr *mr; mr = ibv_reg_mr(pd, buf, size, IBV_ACCESS_LOCAL_WRITE); if (!mr) { fprintf(stderr, "Error, ibv_reg_mr() failed\n"); return -1; } if (ibv_dereg_mr(mr)) { fprintf(stderr, "Error, ibv_dereg_mr() failed\n"); return -1; } |
Register a MR to allow Remote read and write to it:
mr = ibv_reg_mr(pd, buf, size, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ); if (!mr) { fprintf(stderr, "Error, ibv_reg_mr() failed\n"); return -1; } |
FAQs
Why is a MR good for anyway?
MR registration is a process in which the RDMA device takes a memory buffer and prepare it to be used for local and/or remote access.
Can I register the same memory block more than once?
Yes. One can register the same memory block, or part of it, more than once. Those memory registration can even be performed with different access flags.
What is the total size of memory that can be registered?
There isn't any way to know what is the total size of memory that can be registered. Theoretically, there isn't any limit to this value. However, if one wishes to register huge amount of memory (hundreds of GB), maybe default values of the low-level drivers aren't enough; look at the "Device Specific" section to learn how to change the default parameter values in order to solve this issue.
Can I have access a memory using the RDMA device with more permissions than the operating system allows me to?
No. During the memory registration, the driver checks the permissions of the memory block and check that the requested permissions are allowed to be used with that memory block.
Can I use memory block in RDMA without this registration?
Basically no. However, there are RDMA devices that have the ability to read memory without the need of memory registration (inline data send).
ibv_reg_mr() failed, what is the reason for this?
ibv_reg_mr() can fail because of the following reasons:
- Bad attributes: bad permissions or bad memory buffer
- Not enough resources to register this memory buffer
If this is the first buffer that you register, maybe this is because of the first reason. If the memory registration fails after many buffers were already registered, maybe the reason for this failure is that there aren't enough resources to register this memory buffer: most of the time, resources in the RDMA device for the translation of virtual address => physical address. In that case, you may want to check how to increase the amount of memory that can be registered by this RDMA device.
Can I register several MRs in my process?
Yes, you can.
Device Specific
Mellanox Technologies
ibv_reg_mr() failed, what is the reason for this?
If you are using one of the ConnectX HCAs family, it is a matter of configuration; you should increase the total size of memory that can be registered. This can be accomplished by setting the value of the parameter log_num_mtt of the module mlx4_core.
Adding the following line to the file /etc/modprobe.conf or to /etc/modprobe.d/mlx4_core.conf (depends on the Linux distribution that you are using) should solve this problem:
options mlx4_core log_num_mtt=24
Comments
Tell us what do you think.
Hi Dotan sir,
i am very new to IB and uverbs coding, i have a small doubt, will the internal objects like QP, CQ and related handles uses pinned memory(locked memory) or not ?
sorry if i asked silly question.
thanks
Jagadeesh.
Hi Jagadeesh.
Welcome to the RDMA scene
:)
The answer is: yes. The internal Queues (which require space: such as QP, CQ, SRQ) are using pinned memory.
Thanks
Dotan
Hello Dotan
I want to ensure my assumption:
Can the same memory address, be registered to more the one physical device simultaneously?
Thanks.
Boris.
Hi Boris.
The resources of every RDMA device are completely separated.
After saying that, there isn't any limitation to do it but keep in mind that there isn't
any guarantee about the ordering of the access to this buffer by the devices and you need to take care of it in your code.
Thanks
Dotan
Hi Dotan,
I have a question about RDMA operations. In one subnet, if multiple processes on different machines register many MRs, is there any possibility that some 2 of these MRs have the same rkey field? If so, for a RDMA operation which specifies wr.rdma.rkey in its work request, how does Infiniband know which remote MR to send the data?
I'm asking this question because I'm doing a small test and I find the above scenario.
Thanks,
Jiajun
Hi Jiajun.
rkey is an attribute of an RDMA device:
at a specific point in time, only one MR in that RDMA device can have this rkey value.
If you are working with multiple devices in the same server, or with multiple servers in a subnet,
you may get the same rkey value more than once.
Since you are using the rkey in a Send Request and send it to a specific RDMA device (using the destination LID),
this isn't problem and the RDMA protocol knows how to handle this.
I hope that this answer was clear
Dotan
Hi Dotan,
That makes a lot of sense. Thanks.
As I understand, in a subnet, LID along with QP number becomes the unique identifier of a queue pair, while LID along with rkey is the unique id of a memory region.
When a RDMA read/write operation is performed via ibv_post_send(), the hardware will use dlid in the associated qp and rkey in wr.rdma.rkey to locate which remote MR to read from or write to. Is that correct?
Another question, in the above case, will the dest_qp_num in the qp be useful? If not, the qp can read/write data to any MRs that belong to dlid, if appropriate flags have been set, right?
Thanks,
Jiajun
Hi Jiajun.
Yes, in a subnet, a LID along with a QP number becomes a unique identifier of the QP.
Lid along with rkey can be seen as a unique ID of a MR, but you connect using QPs and not using
MRs...
Yes, you are correct:
When performing RDMA Write or Read, the DLID and remote QP number will be taken from the (local) QP context,
and the remote RDMA device will use the rkey (that was posted in the SR) to understand which MR to use.
Connected QPs can work with only one remote QP (hence "connected").
You cannot change the remote QP number after you set this number when modify the QP INIT -> RTR.
Any remote MR, with the right permissions, can be used with that remote QP as long as they share the same PD.
Thanks
Dotan
Hi Dotan,
When using kernel space verbs, before and after RDMA operations it is recommended to call ib_dma_sync*(OFED API) on buffers because of CPU cache.
But while using in user level verbs there is no such option(cache operations) provided, and no one has faced such caching problems.
Can you please help me to understand, in user space how cache coherence is maintained?
Thanks & Regards
Jagadeesh
Hi Jagadeesh.
I'm sorry that it took me some time to answer your question
(I had some technical problems to answer earlier).
This is a great question. There was a mail thread this issue in the it in the linux-rdma mailing list;
the mail thread subject was "(R)DMA in userspace" and in started on 11/10/2012 17:34.
The bottom line is that those called (i.e. ib_dma_syn* calls) are mainly needed for non cache-coherent architecture.
Most (today) machines are cache-coherent, so we don't hit any problem.
However, if one will try to use user level verbs things expected to be broken
(memory may not contain the expected content).
I hope that I answered, you can find much more information in the mail thread above.
Dotan
Hi Dotan,
Thanks for your reply.
The link helped to make things clear.
Thanks & Regards
Jagadeesh
This is great.
Thanks for the feedback
:)
Dotan
Hi Dotan,
Should the memory address passed to ibv_mr_reg be page-aligned?
Thanks!
Hi Igor.
The memory address that is being registered doesn't have to be page-aligned.
Thanks
Dotan
Hi Dotan,
The latest OFED contains "peer memory" API, which is unfortunately not covered yet in your blog. Still, I hope I may ask a question on this subject.
I'm using this new API to enable registration of a virtual memory region, which is actually mmapped from a PCI memory region - to enable RDMA Write to this PCI memory. IUUC, to enable such a functionality, one should implement all the callbacks provided in struct peer_memory_client, (as described in PEER_MEMORY_API.txt), including get_pages(). The question is how one should implement this function, considering the fact that there are no struct page's for PCI memory region.
Following my previous comment - actually I've figured out that it's enough to fill sg-entries with physical addresses and lengths, no need in struct page's.
Great, thanks for the update
:)
Dotan
Hi Dotan,
I got a question on registering a single memory region for multiple clients.
I have a server which provides a memory region to be shared and accessed by different remote clients. Previously, I registered the MR when the server is set up(right after ibv_cm_id and PD are created). Both remote READ and WRITE operations worked, but atomic operation did not work. So I tried putting the registration process at the stage where the first client is establishing the connection(i.e. when the server receives a RDMA_CM_EVENT_CONNECT_REQUEST event). By doing this, the atomic operation worked. This confused me because to register a MR, only a PD and an associated allocated memory block are required. Why did the atomic operations fail with the first method since the PD does not change? And what is the right way to register a memory region for multiple clients? Do I need to register one MR for every client, or a single MR for all clients?
Thanks!
Hi Eric.
I must admit that I don't fully familiar with the librdmacm functionality.
However, I'll try to shade some light...
You can register a single MR for all clients, as long as all of them share the same PD.
If there is a problem with the QPs after connecting them, I suggest that you'll perform query QP, and check the attributes:
* qp_access_flags
* max_rd_atomic
* max_dest_rd_atomic
Maybe the reason that atomic operations failed was that the QPs weren't configured to support it.
If you can share the source code, maybe I can give you some more insights..
I hope that this helps you
Dotan
Hi Dotan,
I'm using SoftiWarp. When I try to register with different sizes,ibv_reg_mr fails. Example, when I try 1kb, 4kb, 16kb to register, there is no problem. But when it is 64kb or more, I'm getting the error. What can be the reason for it?
Hi.
Can you check using 'ulimit' how much memory your process can lock?
(ulimit -l)
Thanks
Dotan
Hi,
It gives 64. What does it mean?
That your process can lock (i.e. pin) up to 64KB memory.
I would suggest increasing this value if you want to work with RDMA...
Thanks
Dotan
I solved the problem by configuring the /etc/security/limits.conf file.
Thanks.
Yes.
This is another way to change the amount memory which can be locked.
I'm glad I could help you
:)
Thanks
Dotan
Hi Dotan,
I was trying to implement an application that support rdma using
librdmacm. When the process is trying to register the buffer using
ibv_reg_mr, it failes after registering approximately 128 MB.
As you suggested, I reconfigured options as
1) cat /sys/module/mlx4_core/parameters/log_num_mtt
24
2) cat /sys/module/mlx4_core/parameters/log_mtts_per_seg
7
Also i have set value in /etc/security/limits.conf as
* soft memlock unlimited
* hard memlock unlimited
So I hope that will allow us to pin unlimited memory with kernel. But for us, we can able to register approximately 128 MB. Further allocation is failed by returning NULL from ibv_reg_mr, by
setting errno=11 (Resource temporarily unavailable).
Am i missing something or am i doing wrong ?
If you can point out any thing regarding this will be great.
Thanks & Regards!
Rafi KC
Hi Rafi.
* Can you send me the output of "ulimit -l"?
* If you can share the code, I can tell you what is wrong
(assuming that the problem is with the code that you wrote).
* Did you try to register lower amount of memory? in which size did it start to fail?
* Did you check dmesg or /var/log/messages for errors?
Thanks
Dotan
1) Can you send me the output of "ulimit -l"?
A) unlimited
2) * If you can share the code, I can tell you what is wrong
A) https://github.com/gluster/glusterfs/blob/master/rpc/rpc-transport/rdma/src/rdma.c line number 111
3)* Did you try to register lower amount of memory? in which size did it start to fail?
A) it successfully works for lower amount of memory, and fails at 128 MB (+1 tolerance ).
4) * Did you check dmesg or /var/log/messages for errors?
A)http://ur1.ca/iritw
Regards
Rafi KC
Hi Rafi KC.
Which ibv_reg_mr failed?
(there are several memory registration in your code).
What is the total amount of memory that you registered?
Please try to increase log_num_mtt (when loading the mlx4_core driver) and check if this helps.
Thanks
Dotan
Hi Dotan,
I'm using NVIDIAs GPUdirect technology and my system freezes when reading from GPU memory with ibverbs. Writing to it works all right. I'm wondering how ibv_reg_mr knows, that it should pin GPU memory. As far as i understand it, the nv_peer_mem kernel module registers the GPU memory as peer memory with ibverbs.
But how does ibv_reg_mr "talk" with the gpu memory instead of the host memory? I've been looking through the source code for hours but can't figure it out.
Any help would be appreciated!
Regards,
Max
Hi Max.
There are two modules when working with GPU memory:
1) driver that the GPU vendor provides (in your case, NVIDIA) which allows the kernel to access and map the GPU memory
2) RDMA core that allows registration of memory
From the RDMA core point of view, GPU memory is "regular" system memory that it can work with,
there isn't any special code that handles this type of memory.
I suggest that you'll install the latest driver from NVIDIA.
Thanks
Dotan
Just out of curiosity: doesn't GPU memory need to be registered in a kernel module via GPUDirect (PeerDirect) mechanism, to allow the subsequent MR registration? I guess, the regular MR registration wouldn't work, as get_user_pages() would fail for GPU memory...
Hi Igor.
Sorry about the late response; I'm moderating all comments and this one entered the wrong category.
Anyway, you are correct: a specific plugin (kernel module) should be written to register this GPU memory with the MLNX-OFED PeerDirect API, so that it can be used to detect and manage the memory actions related to his memory.
Thanks
Dotan
Hi Dotan,
thanks for your help. It turned out that only specific mainboard chipsets support NVIDIA GPUdirect with Mellanox Infiniband, which is not documented anywhere. With the 4th hardware it finally works now.
Regards,
Max
Great, thanks for the update.
Dotan
Very informative article, thanks for keeping it updated!
I've been trying to ibv_reg_mr a 4kb page of mmaped address from kernel memory, but it throws an EINVAL. A normal posix_memalign call generates an address that can be registered. I'm quite new to this and suspect I'm doing it completely wrong. What's the proper way to do zero-copy from userspace ib results to kernel memory? Thanks!
p.s. the captcha doesn't seem to show up on mobile devices!
Hi Jimmy.
Thanks for the worm words and about the captcha.
:)
In general, every memory which is accessible in a userspace process can be registered.
I'm not fully understand what you are trying to do.
Can you give some more background on what you tried to do and what didn't work?
Thanks
Dotan
I was using "get_zeroed_page" in kernel to allocate memory in kernel space and mapping it to userspace, and ibv_reg_mr throws an error on that.
Today I instead tried "get_user_page" in kernel to map to memory allocated using posix_memalign in userspace. This one works for ibv_reg_mr, so I think I'll run with this :)
So I guess if I were to use kernel allocated memory, I should be using kernel IB verbs?
Hi.
I must admit that I'm not a kernel expert.
AFAIK, the get_zeroed_page() is a special page in kernel: full with zeroes and has the COW indication.
So, when one tries to write on it, it is copied in the virtual space of the process/module.
What is the meaning of mapping this page to the userspace?
allowing the userspace process to modify it?
I suspect that this is the reason that you failed to map it to the userspace.
Another option was to get a page in other kernel service function (for example: kzalloc())
and map it to the userspace.
This is a kernel issue and not an RDMA-related issue...
Thanks
Dotan
Hi Dotan,
Thanks for the all above replies, makes a lot of understanding about the RDMA.
Is it possible to invoke the RDMA operation such as RDMA Send, RDMA Read, RDMA Write from another H/W component or it will be just the S/W interface only.
TIA
Santosh
Hi Santosh.
Short answer : yes.
Long answer : to enable an RDMA device, one needs to have a working low level driver. Assuming that you have it and the other HW component can perform PCI cycles to the RDMA device - the answer is yes.
Thanks Dotan,
If the low level driver prepares the RDMA Queue pair and provides the Queue information to H/W component then the H/W components will know RDMA Queue Pair Configuration such as its base address, Q length, doorbell it can prepare, post the WQE and ring the doorbell for specific queue.
If the above information is available to H/W then it can send and receive the PCIe TLPs.
To do this kind of RDMA operation, what other information will be needed. Basically I am trying to interface the another H/W module with RDMA ASIC engine but swant to use the S/W less in data path.
Thanks
Santosh
Hi Santosh.
In general it can be done. However:
* Since you working with the low level driver directly with the HW, you need to have the HW specification/technical document.
* The control part (which is still running in CPU) contains some state values of the device, you need to figure out how to make this synchronization.
I give you here general comments from the knowledge I have.
If you want to get into more details, I suggest that you'll contact the support of the HW vendor that you'll work with.
Thanks
Dotan
Thanks Dotan for the clarificaton.
Hi Dotan,
I am comfused about how dynamically allocated memory (using malloc() or mmap()) to be DMA-able. Does RNIC driver allocated a bounce buffer for user virtual memory(this virtual memory has been mapped to physical memory) registered? If so, data copy happens between user virtual memory and bounce buffer. when is this data copy process triggered?
Hi.
I'm will describe what is going on in InfiniBand and RoCE (don't know about iWARP).
The answer is no; the RDMA core (in the kernel level) translates the virtual buffer to the physical address of its pages,
and Work Request access them directly.
So, the RDMA device access directly the memory buffer hence zero copy.
Thanks
Dotan
Hi Dotan, thanks for maintaining a very informative and helpful site.
I have a simple question for you.
Is it guaranteed that the member 'addr' of struct created by ibv_reg_mr is always equal to the addr passed in the argument?
(It seems that way from the Mellanox examples, but just to be sure)
:)
Thanks!
The answer is yes, "addr" always points to the buffer register by the Memory Region
(unless bad thing happened, and someone changed this value; a thing that shouldn't happen)
Thanks
Dotan
Hi Dotan,
I am relatively new to IB and RDMA. My question may seem silly, but please don't mind. In your FAQs you have said that the same memory region can be registered more than once. Say, for example, ibv_reg_mr() is called with the same parameters twice ie the addr and size are the same and also the pd and access field. In this case, what fields will vary when ibv_reg_mr() returns? I don't know if the lkey and rkey values differ since ibv_reg_mr() will be called twice. Just trying to understand what happens in this case.
Thanks,
Ranjit G M
Hi.
Registering the same memory buffer twice (with same or different permissions) will end with two different Memory Region handles; every one of them will have unique lkey and rkey values.
Thanks
Dotan
Thanks for the answer!
Hi,
As this function is quite expensive, I try to minimize memory registrations as much as possible. As far as I understand, ibv_reg_mr should check permissions. But consider the case
int* data = new int[1000];
auto mr = ibv_reg_mr(... data...);
delete[] data;
int* data2 = new int[1000];
Assume that data2 == data (which is quite reasonable). I'd assume that in this case, mr can still be used, but this seems not to be the case.
So: Is there a way to check if a memory region is valid?
Kind regards
Hi Lukas.
During the memory registration, the reference count on the memory pages is increased,
so the physical memory pages still exist, even if you free the pages.
The mapping virtual <-> physical from the RDMA device point of view is constant,
until you'll reregister it or do any other manipulation to it in the RDMA stack.
The fast that you freed this block and then allocating a new one and got the same address is not relevant
to the fact that the physical pages for this buffer was changed.
There isn't any way to check that a Memory Region is valid;
if you registered it and didn't deregister/invalidate it - this Memory Region is still valid.
Thanks
Dotan
Hello,
I've a question about memory registration: consider a long (pinned) buffer and many different ibv_post_send (or ibv_post_recv) related to different part of this buffer (i.e. send(buffer+someValue, ...) , send(buffer+someOtherValue, ...), etc... ).
Is it better to register only once (at the beginning) the entire buffer and use the same registered memory key for all the send (receive) operations?
Or is it better to register a new memory region for each part of the buffer before a send (receive)?
Best Regards
Hi.
I believe that you are asking performance wise what is better;
the answer is "depends on the behavior of your RDMA device".
I would suggest to use one big Memory Region an use different parts of it,
on demand
(the management of it is easy
+
you will get many cache hits)
Thanks
Dotan
Hi,
I really got stock in memory registration. I am going to register 2G of memory but with different memory region (each one 16 bytes). More precisely, the following code:
for (int i=0; i<xxxxxxx;i++)
mr=ibv_reg_mr();
but after 1000 times I get an error which can not allocate memory. I checked the configuration file it should allow me to allocate memory at least some GBs. I really appreciate it if you let me know if there is any solutions???
Thank you so much!!!!
Hi.
I suspect that the value of 'ulimit -l' (i.e. the amount of memory that can be pinned)
is limited.
Please check this and increase this limit.
Thanks
Dotan
Hi Dotan,
Thank you for your response. I actually checked but it is set to unlimit. I should submit my job in a cluster then wait for a free computation node to execute my code. Do you suppose it can be the restriction of RDMA itself or OS?
Hi.
It can be both:
* lack of resources from the RDMA device
* limitation of the OS itself
Check what is the maximum buffer that you can lock;
if it is ~ 32K or 64K, most likely it is environment (i.e. OS problem).
Thanks
Dotan
Can you please let me know how can I check the buffer that I can lock?
When I increase the third parameter of ibv_reg_mr(), Size of the memory block to register, to 2GB it is OK and works but 2GB in different chunks(MR) does not work!
Actually, my goal is to create a hash table and have direct access from another node. I dont know how can I manage this. Can I do it with one memory registration?
Thank you,
Hi.
You can use one big Memory Region and access if locally/remotely.
For understand what is the problem in your setup, more information is required ...
Thanks
Dotan
I really appreciate your responses. I try to describe my problem in a small size, I hope it is clear:
I allocate a memory region as follows:
struct ibv_mr * tmp_mr;
struct shared_data{
int32_t Data;
struct ibv_mr * next;
};
struct shared_data * data_region = ( struct shared_data *) malloc( sizeof(struct shared_data) );
data_region->Data=1;
data_region->next.addr=NULL;
tmp_mr = ibv_reg_mr(s_ctx->pd, data_region, sizeof(struct shared_data) * 2 , IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ);
In the ibv_reg_mr API, I allocate a double size of shared_data (sizeof(struct shared_data) * 2 ). Is it possible to access each one separately by memory region? Actually, if I allocate them separately in a large scale RDMA give me an error.
Hi.
Maybe I'm missing something here,
but data_region is the sizeof (struct shared_data), but you try to register twice of its size;
i.e. you try to register memory which isn't allocated
(i.e. not in the program virtual space).
You can register two memory blocks, and make each of them have a different handle
(and local/remote key).
If you have a problem, maybe you can send me the code for review.
Thanks
Dotan
Hi Dotan,
I am not able to register the memory memory mapped region from NVDIMM. IS there any restriction?
Hi.
I'm not familiar with NVDIMM,
but in order to register such a memory, a driver which allows to register this memory must be available.
Thanks
Dotan
Hi Dotan, I am surprised that an ibv_query_mr() verb does not exist, given the ability of a remote peer to invalidate an application's (exported) MR (via Send-with-Invalidate). Since CQEs do not report incoming IETH key values, how does an application determine if its exported MR remains valid?
Hi.
This is a great question.
However. Send-with-Invalidate only invalidates *that* specific rkey and not the Memory Region itself,
Other rkeys to that region or the virtual-physical mapping to this Memory Region are still valid.
There is an indication in the responder side about the invalidation in the Work Completion:
that this message invalidated an rkey (IBV_WC_WITH_INV) and the value of rkey that was invalidated (invalidated_rkey).
The SW must keep track the status of the invalidated keys.
AFAIK, there isn't any verb that verifies if a specific rkey is valid.
Thanks
Dotan
Hi Dotan,
Recently I am testing infiband rdma bandwidth.
I found something that I can not understand.
buffer_size=4K
recv_post_buffer_count=4096
send_post_buffer_count=4096
transmissions_count=4096
two QPs and two send threads, every send thread tans 2048
one srq
one recv_completion_thread
one send_completion_thread
CASE1. Client: alloc 4096 post_send_buffer(4K) and ibv_reg_mr(buffer) 4096 count,every post_send_buffer has its own mr.
Server: alloc 4096 post_recv_buffer(4K) and ibv_reg_mr(buffer) 4096 count,every post_recv_Buffer has its own mr.
In this way(CASE1), its bandwidth only has half of full bandwidth.
CASE2: Client: alloc a big_buffer(4096 * 4K) and big_mr=ibv_reg_mr(big_buffer), then split big_buffer as a 4K post_send_bufferļ¼they share the big_mr..
Server: alloc a big_buffer(4096 * 4K) and big_mr=ibv_reg_mr(big_buffer), then split big_buffer as a 4K post_recv_bufferļ¼they share the big_mr..
In CASE2, the bandwidth can arrive full bandwidth.
Can you tell me what the different about CASE1 and CASE2?
Hi.
The HW vendor of this adapter will be able to provide more details;
but I would suspect:
* Maybe it is lack of SW optimization (for example: registering the memory region as part of the data testing)
* let's assume that there is a cache for MR data; in CASE1 the RDMA device my get cache miss in many messages
(depends on the cache size and order of packets), in CASE2 the information about that MR is always in the device's cache.
My 2 cents, but this is just my guess.
Thanks
Dotan
Dotan,
Fantastic blog/site. Really.
1) I have a configuration with two CentOS workstations each with a Mellanox ConnectX-5 device.
2) New to RDMA but I was able to get several of the examples in the OFA suite of examples running. And the scenarios in which they run the performance is impressive.
3) Now trying to modify scenarios to suite my needs I run in to trouble. If I run with small and few buffers the client/server runs fine. For example, running blast-rdma-1 with 2 buffers (-b) of size 8192 (-s) it runs fine. If I run with 4 buffers or 16384 it crashes with:
ibv_reg_mr returned 12 errno 12 Cannon allocate memory
4) I'm having our IT guy add a line to /etc/modprobe.d/mlx4_core.conf as you suggested.
5) What do you think the issue is? The cards do work. But those seem like extremely small sizes. The default for the parameter you suggested (log_num_mtt) is 19. You suggested 24. What do these values mean?
Any help is appreciated.
Thanks,
Paul
Hi.
I have a feeling that you have a low limit for the amount of memory that can be locked;
I suggest that you'll check the value of 'ulimit -l' and increase it.
Thanks
Dotan
Hi Dotan,
Thank you for maintaining this site, which has proven extremely useful to me!
I have a question about registering the same memory address several times using ibv_reg_mr(). What I would like to achieve is as follows: I have a server S and two clients A and B; S exposes some memory M to both A and B; I would like A to be able to have read-write access to M through RDMA and B to have read-only access to M.
So far I have tried registering M twice on S, once with read-write access flags and once with read-only access flags. I share the rkey of the read-write registration with A and the rkey of the read-only registration with B. I used only one protection domain. When running the code, I notice that only the last registration is retained: if I register read-only, then read-write, both A and B are able to write; if I register read-write, then read-only, neither A nor B is able to write.
Is this expected behavior? Is there another way to achieve different access permissions from A and B to S's memory? Thank you in advance!
Hi.
You can register M twice once as read-only and once as read-write and provide the right r-key to the remote client;
and this should work.
If the access right to both memory blocks are according to the last registration - this is a bug;
it should be possible to register the same memory buffer multiple times with different permissions.
BTW, if you wish - you can have only one MR (with remote read + write enabled),
and set the remote access rights in the QP level...
Thanks
Dotan
Hi Dotan,
Really appreciate your contribution. Can you please help me with my quey? I have setup two IB connections on my system( two different IP addresses). I have two processes each connected to port on those two different IB Ip address. Is it possible for both of them to register to same memory region, read and write to it in parallel? If not, is read at least possible in parallel if both process register to the same region with just read access?
Hi.
A Memory Region exists within a specific device,
if you have 2 different RDMA devices - you can't have one Memory Region for both of them.
A Memory Region is associated with a specific Protection Domain,
and different processes can't share Protection Domains.
However, you can register the same memory buffer twice, from different processes, for example: if you are using shared memory.
But you'll have to sync the access to those two Memory Regions (to prevent read/write to same address in parallel).
Thanks
Dotan
Thanks for your reply. Really appreciate it.
Hi Dotan,
Thanks for your mail. If I get you correctly, I can register same memory buffer twice from two different processes even though each of them are connected to ports of different physical RDMA devices(different IP). Is that correct?
Hi,
Yes. You can register the same memory buffer by different processes,
as long as this buffer is available to both processes.
This registration can be done to different RDMA devices.
Thanks
Dotan
Hi Dotan,
I have a clarification question with respect to multiple registrations of the same memory. If I have two memory regions A and B of the same buffer and allow atomic operations by remote processes on both of these memory regions, is atomicity maintained between A and B? That is, if two remote processes simultaneously try to atomically update the buffer using rkeys of the different memory regions, the final value in the buffer is not guaranteed to be correct, right?
Hi.
The answer is "yes":
If the RDMA device supports atomic operation and you are using atomic operation in same or different processes within the same RDMA device,
the value in the remote address should be accessed in atomic way (according to the used operation).
Global (i.e. withing several RDMA devices in the same host) *may* be supported - one should check the device capabilities.
Thanks
Dotan
Hi Dotan,
An excellent blog site here. Learned a lot from this. I am new with this so pardon my ignorance but I have a question. I want to register large amounts(at least a few hundred GBs) of memory using ibv_reg_mr. The ibv_reg_mr maps the memory so it must be creating some kind of page table right?
I want to calculate the size of the page table created by ibv_reg_mr so that I can calculate the total amount of memory that would be required to store the mr of a particular amount of memory. Can you explain the logic of calculating such required amount of memory to store the mr? or is there a blog available somewhere which explains this logic?
If there is an example which shows the calculation for the amount of memory required to store the mr for let's say 500 GB of memory then that would be great. Note that I am talking about creating a single mr for all the memory to be registered and not doing this in chunk of small memory blocks.
Hi
Thanks :)
Every registered memory block will require a page table (no matter what is its size).
The needed space for that page table is device specific.
Let's try to have a *very* rough calculation for the needed information for every registered memory block,
(according to the information that we need to hold):
- For permission: 32 bits
- For VA: 64 bits
- Let's divide the registered block to 4KB page: for each page we'll need need 64 bits
This is the minimum required space, maybe there will be padding some where (depends how this data is saved in the device).
Again, this is just an estimation...
I hope that this helped you.
Thanks
Dotan
Hello Dotan,
I see. Thanks for the explanation. Correct me if I am wrong, according to your estimation, a single 128 GB block would need 32 permission bits + 64 VA bits + (128G/4K)*64 bits. right? So for registering a single block of 128 GB, I would need 2 GB of space? Is there a concept of multi-level page table here or not?
When I try to register a single block of 128 GB on my system, I can see around 0.8 GB of RAM used by the program. This amount was 0.4 GB for 64 GB. So seems like there are. 160 bits for each 4k page division of the block? This is shown in the MemFree section of the /proc/meminfo before and after running the program. Can you deduce the logic of the amount of memory required to store the MR from this for my system? Note that I am using huge pages of 2 MB for allocating a large single block. Does that have anything to do with this?
Also you said the data is stored on the device? you meant the page table? Plus, you were saying it depends on the device. I would like to know about the Mellanox cards. Your blog says that you worked with Mellanox for a greater part of your career. So maybe you have a clue about how much space ibv_reg_mr will take for a Mellanox card? Also if there is a site or code link that you know of which might help in deducing the logic for Mellanox. then that would be good too.
Thanks in advance!
Hi.
As I said, it is vendor and device specific,
there can be implementations where they the maximum possible page will be used (for example: huge pages or even more),
and I'm sure there are more tricks in Linux to get a continues big physical block.
The data is stored for the device to use, either in the device's or in host memory
(depends on the device technology).
I'm sorry, but I prefer not to provide any device/vendor specific information (for many reasons).
If you need information for a specific vendor you can contact the support of that vendor
or review the (open) low-level drivers of that vendor or contact the maintainer of that driver.
I hope that you find this answer helpful
:)
Dotan
Hi Dotan,
I am new to RDMA. I have a good enough understanding of concepts. I am trying to achieve a many-to-many communication, I intend to use ibv_post_send or ibv_post_recv. Is it necessary to register multiple memory regions in this case? If so, how to achieve multiple memory regions.
(Apologies if these questions are trivial, I am a little confused.)
Thanks!
Hi.
You can register one Memory Region and use it (or part of it) in all the QPs.
However, all the QPs and that MR should be created with the same Protection Domain.
Thanks
Dotan
Hi Dotan, your blog has been incredibly valuable
I do have one question:
I have a use where I want to consistently write data to a separate machine as it becomes ready. I know the size of each message but not the total number of messages. I understand it is smart to not have the reg_mr in the data path and to keep it in the setup of the endpoints. Would I be best pre-allocating a large MR and continuously writing to it? If so is there a way to free up space as the data is handled in the MR so that I could write to the same MR indefinitely, which the client repeating a write work request?
Thanks
:)
Hi.
Memory Region handling isn't trivial,
reregistering the MR (i.e. changing its size, if the RDMA device supports it), can take time
+
the local and remote keys will be changed, so you can't use this Memory Region with the old keys.
I see several ways to handle your problems:
* Register a big Memory Region and use the space you need, and deregister it only at the end
* Register several Memory Regions and deregister Memory Regions that you used and you won't use anymore
(although, deregistration is time consuming as well, and it isn't good to use it in data path)
* Register a Memory Region (one or more) and define a protocol with the remote side,
to sync where it will perform RDMA Read/Write (if RDMA is used) or define kind of flow control by your SW
(according to the number of Receive Requests/available buffers).
This is my 2 cents..
Dotan
Hi Dotan,
I'm an engineering focusing on RDMA technology. In my view, RDMA Read/Write uses virtual addresses and RNIC translates these addresses into physical addresses. And the ibv_reg_mr passes these translation entries to RNIC. Recently, I have read a paper named "LITE kernel RDMA support for datacenter applications", in Section 4.1, the authors say that they use an infrequently used verb to register MRs with RNIC directly using physical memory addresses, and they can issue RDMA request using physical addresses. However, when I refer to RDMA Programming manual and google for this feature, nothing came out. Do you have any idea about this feature(or what verb/API support this feature?)
Any suggestion is well-appreciated :)
Hi.
This can be done ONLY in kernel space,
since in user space you are working only with virtual addresses.
I think you are referring to Fast Registration Work Request (FRWR)
Thanks
Dotan