Verify that RDMA is working
Contents
In the last few posts, I explained how to install the RDMA stack in several ways (inbox, OFED and manually). In this post, I'll describe how to verify that the RDMA stack is working properly.
Verify that RDMA kernel part is loaded
First, one should check that the kernel part of the RDMA stack is working. There are two options to do this: using the service file or using lsmod.
Verify that RDMA kernel part is loaded using service file
Verify that the kernel part is loaded can be done using the relevant service file of the package/OS. For example, over inbox RedHat 6.* installation:
[root@localhost] # /etc/init.d/rdma status Low level hardware support loaded: mlx4_ib Upper layer protocol modules: ib_ipoib User space access modules: rdma_ucm ib_ucm ib_uverbs ib_umad Connection management modules: rdma_cm ib_cm iw_cm Configured IPoIB interfaces: none Currently active IPoIB interfaces: ib0 ib1 |
Verify that RDMA kernel part is loaded using lsmod
In all Linux distributions, lsmod can show the loaded kernel modules.
[root@localhost] # lsmod | grep ib mlx4_ib 113239 0 mlx4_core 189003 2 mlx4_ib,mlx4_en ib_ipoib 68315 0 ib_ucm 9597 0 ib_uverbs 30216 2 rdma_ucm,ib_ucm ib_umad 8931 4 ib_cm 30987 3 ib_ipoib,ib_ucm,rdma_cm ib_addr 5176 2 rdma_ucm,rdma_cm ib_sa 19056 5 mlx4_ib,ib_ipoib,rdma_ucm,rdma_cm,ib_cm ib_mad 32968 4 mlx4_ib,ib_umad,ib_cm,ib_sa ib_core 59893 11 mlx4_ib,ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,ib_sa,ib_mad |
One should verify that the following kernel modules are loaded: ib_uverbs and low-level driver of the HW that he has in his machine.
Verify that userspace applications are working
Verify that RDMA devices are available
ibv_devices is a tool, that included in the libibverbs-utils rpm, and shows the available RDMA devices in the local machine.
[root@localhost libibverbs]# ibv_devices device node GUID ------ ---------------- mlx4_0 000c29632d420400 |
One should verify that the number of available devices equals to the expected devices in his local machine.
Verify that RDMA devices can be accessed
ibv_devinfo is a tool, that included in the libibverbs-utils rpm, and opens a device and queries for its attributes and by doing this verify that the user and kernel part of the RDMA stack can work together.
[root@localhost libibverbs]# ibv_devinfo -d mlx4_0 hca_id: mlx4_0 transport: InfiniBand (0) fw_ver: 1.2.005 node_guid: 000c:2963:2d42:0300 sys_image_guid: 000c:2963:2d42:0200 vendor_id: 0x02c9 vendor_part_id: 25418 hw_ver: 0xa phys_port_cnt: 2 port: 1 state: PORT_ACTIVE (4) max_mtu: 4096 (5) active_mtu: 4096 (5) sm_lid: 1 port_lid: 1 port_lmc: 0x00 link_layer: InfiniBand port: 2 state: PORT_INIT (2) max_mtu: 4096 (5) active_mtu: 256 (1) sm_lid: 0 port_lid: 0 port_lmc: 0x00 link_layer: InfiniBand |
One should verify that at least one port is in PORT_ACTIVE state, which means that the port is available for working.
Verify that traffic is working
Send traffic using ibv_*_pingpong
The ibv_*_pingpong tests, that included in the libibverbs-utils rpm, and sends traffic over RDMA using the SEND opcode. They are relevant only to InfiniBand and RoCE.
It is highly recommended to execute those tools with an explicit device name and port number, although it will work without any parameter; since without any parameter they will work with the first detected RDMA device and port number 1.
Here is an execution example of the server side:
[root@localhost libibverbs]# ibv_rc_pingpong -g 0 -d mlx4_0 -i 1 local address: LID 0x0003, QPN 0xb5de9e, PSN 0x9d7046, GID fe80::c:2963:2d42:401 remote address: LID 0x0003, QPN 0xb5de9f, PSN 0xfeec26, GID fe80::c:2963:2d42:401 8192000 bytes in 0.27 seconds = 239.96 Mbit/sec 1000 iters in 0.27 seconds = 273.11 usec/iter |
Here is an execution example of the client side (the IP address is the trusted IP address of the machine that the server is running at):
[root@localhost libibverbs]# ibv_rc_pingpong -g 0 -d mlx4_0 -i 2 192.168.2.106 local address: LID 0x0003, QPN 0xb5de9f, PSN 0xfeec26, GID fe80::c:2963:2d42:401 remote address: LID 0x0003, QPN 0xb5de9e, PSN 0x9d7046, GID fe80::c:2963:2d42:401 8192000 bytes in 0.27 seconds = 245.91 Mbit/sec 1000 iters in 0.27 seconds = 266.50 usec/iter |
One should execute the server side before the client side (otherwise, it will fail to connect to the server).
Send traffic using rping
rping is a tool, that included in the librdmacm-utils rpm, and sends RDMA traffic. rping is relevant for all RDMA powered protocols (InfiniBand, RoCE and iWARP).
The address for both client and server sides (the '-a' parameter) is the address that the server listens to. In InfiniBand, this address should be of an IPoIB network interface. In RoCE and iWARP this is the network interface IP address.
Here is an execution example of the server side:
[root@localhost libibverbs]# rping -s -a 192.168.11.1 -v server ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr server ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs server ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst server ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu server ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv server ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw server ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx server ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy server ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz server ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA server ping data: rdma-ping-10: KLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA server ping data: rdma-ping-11: LMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzAB server ping data: rdma-ping-12: MNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABC server ping data: rdma-ping-13: NOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCD server ping data: rdma-ping-14: OPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDE server ping data: rdma-ping-15: PQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEF server ping data: rdma-ping-16: QRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFG server ping data: rdma-ping-17: RSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGH server ping data: rdma-ping-18: STUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHI server ping data: rdma-ping-19: TUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJ |
Here is an execution example of the client side:
[root@localhost libibverbs]# rping -c -a 192.168.11.1 -v ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA ping data: rdma-ping-10: KLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA ping data: rdma-ping-11: LMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzAB ping data: rdma-ping-12: MNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABC ping data: rdma-ping-13: NOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCD ping data: rdma-ping-14: OPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDE ping data: rdma-ping-15: PQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEF ping data: rdma-ping-16: QRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFG ping data: rdma-ping-17: RSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGH ping data: rdma-ping-18: STUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHI ping data: rdma-ping-19: TUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJ |
One should execute the server side before the client side (otherwise, it will fail to connect to the server).
rping will be running endlessly and continue printing the data to stdout until CTRL-C will be pressed.
Comments
Tell us what do you think.
Hi Dotan,
I have a very fundamental doubt.
what actually improves latency in RDMA.
is it kernel bypass, is it address translation saved(memory registration) if YES, its all there in send/recv as well.
Only reason I can imagine is one dma saved in RDMA, which otherwise needs to be done to retrieve recv WQE in send/recv.
Please clarify, if possible
Hi Murthy.
Several things improves the latency of RDMA (compared to other technologies). I guess that the most important ones are:
* Kernel bypass (which may take tens to hundreds of nano second at each side)
* The fact that memory buffers always present at the RAM (no page faults)
* The fact that the RDMA device handles the data path and not SW (i.e. the network stack)
When comparing RDMA Write vs. Send, in RDMA Write there isn't any consumption of a Receive Request (less PCI transactions),
and as soon as data is received, the device knows the address that it should be written to (no delay until the Receive Request is fetched).
I hope that this was clear enough
:)
Dotan
Hi Dotan,
can u please throw light on technique/feature which improves latency in RDMA.
only reason I can imagine is one DMA is saved, which otherwise needed to fetch recv WQE in case of send/recv
I believe that I answered to you in the previous comment...
Hi! Given that we want to transfer 100 pages (continuous or not) through RDMA READ, which way is more efficient, 100 WRs with only one SGE or 10 WRs with 10 SGEs? Thanks very much!
It is HW specific.
However, IMHO 10 WRs with 10 S/Gs will be more effective than the other suggestion,
since the overhead of Send Requests attributes (not related to the S/Gs) checkers will be reduced.
For example: check if QP exists, check if WQ is full, etc.
I would suggest to write a benchmark to be sure.
Thanks
Dotan
RDMA can be implemented inside kernel? or kernel module?
I want it to be transparently doing its job.
Is there kernel-level implementation that uses only kernel headers?
Hi.
Yes. RDMA can work in kernel level.
IPoIB is an example to such a module.
Thanks
Dotan
Hello again, my good sir!
Finally having my Mellanox Ex III/20GBps (MT25208's) installed, I came across some good info on, thus decided to switch to Debian 8 instead of SLES 11 SP4. The cards seem detected, yet I'm having more than a bit of bother.
I have two machines, HPV00 & HPV01, respectively. Both 4x PCIE cards are in 8x PCIE 1.0 slots.
When attempting to connect each card's respective port 0 to the another, I can only get them to link @ 2.5 Gbps (HPV00 Port 0 to HPV01 Port 0). When connecting HPV00 Port 0 to HPV01 Port 1, I get a ibstate rate of 10 Gb/sec (4X). Connecting HPV00 Port 0 to HPV00 Port 1 returns a linked rate of 20Gb/sec (4X DDR)... per card specs.
I am unable to get IPoIB operational, thus unable to verify that traffic is working (as advised in this article).
I think I bungled up my port rates not knowing how to use ibportstate properly. How can I ensure I've properly reset {port, node, etc} GUIDs LIDs back to their default states &/or how can I force 4x DDR on ea. port?
I am using an OpenSM 3.3.18 config (/etc/opensm/opensm.conf), from Debian repos, not Mellanox OFED. Apologies that I should have called "Port 0" "Port 1", etc., per ibstat & ibstatus.
Linux hpv00 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux
Linux hpv01 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux
HPV00 & HPV01 lsmod ib
http://pastebin.com/ELtfTE8x
HPV00 mthca Port 0 to HPV01 mthca Port 1
http://pastebin.com/0PgKkLpq
HPV00 & HPV01 mthca ibportstate history
http://pastebin.com/d8kZHaEk
Any advice is appreciated.
CORRECTION:
Per HPV00 mthca Port 0 (really "Port 1") to HPV01 mthca Port 1 (really "Port 2")'s pastebin, I have misinfo in it. Checking the OpenSM Status's "Loading Cached Option:guid = 0x0002c90200223ac9",
ll /var/log/opensm.0x0002c90200223ac9.log returns 542798 [7A1EB700] 0x02 -> SUBNET UP
Apologies for adding to the confusion. Please advise if there's any more info I can provide.
A better log for my comment stating my correction:
HPV00 log OpenSM bound Port GUID after fresh SM restart
http://pastebin.com/RT2Unqwd
Hi.
Sorry, I had a lot to handle and fail to answer until now.
Do you still have a problem?
what is the output of ibv_devinfo?
(you can send me this by mail)
Thanks
Dotan
Hello Dotan -
It's a pleasure to hear from you.
Since posting 2016-08-20, I switched my connected port on HPV01 mthca from Port 1 (GUID 0x0002c90200223719) to Port 2 (GUID 0x0002c9020022371a) as HPV01 Port 1 was only showing LinkUp of 2.5/SDR when connected to HPV00 mthca Port 1 (HPV00 GUID 0x0002c90200223ac9). I have only one cable.
I have successfully setup IBoIP via HPV00 Port 1 to HPV01 Port 2 ... though, as stated above, it's only connected @ 10Gbps/4X. Of course, I would prefer to be able to ensure all ports are running as 20GBps/4X DDR.
As requested & for the sake of anyone stumbling across this thread, here's the ibv_devinfo && ibstatus && ibstat && iblinkinfo && ibportstate -D 0 1 && ibdiagnet -lw 4x -ls 5 -c 1000 && lspci -Qvvs && cat /sys/class/net/{ib1, ib0}/mode && uname -a for both HPV01 && HPV00.
HPV01 Port 2 to HPV00 Port1 - 4X - Jessie
http://pastebin.com/4Lkparcm
HPV00 Port 1 to HPV01 Port 2 - 4X - Jessie
http://pastebin.com/5AwqNUAB
Looking forward to your insights.
(Note: it seems I'm unable to reply to your 2016-09-16 response as max. thread depth seems reached.)
Hi.
I suggest to ignore the port attributes before SM configured the fabric;
I can see that since the logical port is INITIALIZING and not ACTIVE.
The SM will configure the ports to use maximum possible values.
Thanks
Dotan
Hi I have downloaded the linux distro and the corresponding libs as specified in the below link.
https://community.mellanox.com/docs/DOC-2184
As the above ink specifies i did exactly same.
Iam able todo ibv_rc_pingpong on both client and server.
But when i try todo rping it's not working.Any suggestions here will help me a lot.
Hi.
I can't answer if I don't know what the problem is;
rping is working on any RDMA device (from all vendors):
* InfiniBand - if the IPoIB I/F is up and configures
* RoCE - if the I/F is configured
Thanks
Dotan
I do not get it.There are inbox driver and mellanox OFED driver. It is built and ready to use.It is inbox driver.When you do mellanox ofed driver, it is uninstalling some parts of kernel then inserting itself with some stuff.Why do someone bother to do this? What are the advantages of using of mellanox ofed driver over inbox driver? thnx
thnx
Hi.
A word of ethics: I'm currently a Mellanox technologies employee.
Now, to the answer:
The inbox driver is a relatively old driver which is based on code which was accepted by the upstream kernel.
MLNX-OFED contains the most updated code with some features/enhancements that:
a) weren't (yet) submitted to the upstream kernel due to lime limitation
b) were merged to the upstream kernel by yet not released in the inbox driver (by any Linux distribution)
c) features that were denied by the community
The downside of this is that you change the kernel modules that you load,
with all the implications of this...
Thanks
Dotan
Hi Dotan,
One question about the IB perf tests (I couldn't find a more relevant rdmamojo page to ask this question).
First let me describe my use case:
So I'm planning to limit the bandwidth of infiniband temporarily (my final goal is to vary the bandwidth and see its impact on my application). The solution I came to is to use infiniband perf tests (e.g. ib_read_bw) in the background to consume part of the bandwidth. For example, by having a ib_read_bw that consumes 9GB/sec of my 10GB/sec network, I will have 1GB/sec left for my application.
Now my two questions:
1- Is there any better, more standard way to limit (or throttle) the bandwidth?
2- Is there a way to prioritize the ib_read_bw packets over my application packets, so that I will be sure that 9GB/sec is dedicated to ib_read_bw, and my app will not steal that.
3- There is a flag in ib_read_bw (-w or --limit_bw) that seems to be perfect for me, but I don't seem to get it to work properly. What I do is:
on the server: ib_read_bw -w 5
on the client: ib_read_bw SERVER_IP -w 5
but the final report indicates that the bandwidth was not limited.
What did I do wrong?
Thank you
Hi.
AFAIK, the only way to limit the BW is use rate_limit in the Address Vector
(the whole of RDMA is best performance - not lowering it).
AFAIK, there isn't any tool that allows controlling the effective BW.
Thanks
Dotan
I have a basic question - I think. Am interested in using RDMA to get data from a Mellanox card to my GPU. The potential wrinkle is that the data is sourced by a non-GPU server that is just spewing out a datastream.
The other question is if I can verify RDMA using a single server that has 2 GPUs and 2 Mellanox cards. Do I need an external switch?
Thanks in advance.
Hi.
1) Let me see if I understand your question:
You have a computer (without a GPU) that has data and you want another computer to takes this data
and write/use it with a GPU.
I don't see any problem with this - it will work;
the GPU isn't really a factor here...
2) I don't understand what is the expected topology:
You can use the following topology:
device 1 port 1 -> device 2 port 1
device 1 port 2 -> device 2 port 2
And you won't need any switch.
If you want full connectivity between all the ports, you'll need a switch
(since in the describes topology, you can't send any message from port 1 to port 2, in any device)
Thanks
Dotan
Hi Dotan,
I hope this post finds you well! All the above tests are working for me except for 'ibv_rc_pingpong'. I am receiving the following error on the client side: Failed status transport retry counter exceeded (12) for wr_id 2. Please see the full output below.
Have you ever encountered such an issue? Any tips/advice would be much appreciated!
Thanks, in advance!
Best,
Hamed
Server:
ibv_rc_pingpong -g 4 -d mlx5_0
local address: LID 0x0000, QPN 0x00067c, PSN 0x8dccae, GID ::ffff:192.168.1.2
remote address: LID 0x0000, QPN 0x000716, PSN 0xbda7d2, GID ::ffff:192.168.1.3
Client:
ibv_rc_pingpong -g 4 -d mlx5_0 192.168.1.2
local address: LID 0x0000, QPN 0x000716, PSN 0xbda7d2, GID ::ffff:192.168.1.3
remote address: LID 0x0000, QPN 0x00067c, PSN 0x8dccae, GID ::ffff:192.168.1.2
Failed status transport retry counter exceeded (12) for wr_id 2
Hi.
Many reasons can cause this problem,
i don't have enough information here to understahnd what went wrong
(MTU too big? network interfaces IPs aren't configured, SM wasn't executed - for IB).
Thanks
Dotan