Working with IPoIB
Contents
In this post, I will explain how to work with IPoIB: loading/unloading the IPoIB module, configuring it and verifying that it works.
IPoIB kernel module control
Verifying that the IPoIB module is loaded
The IPoIB kernel module name in Linux is 'ib_ipoib'. The following command line verifies that it is loaded:
[root@localhost]# lsmod | grep ipoib ib_ipoib 68315 0 ib_cm 30987 3 ib_ipoib,ib_ucm,rdma_cm ib_sa 19056 4 ib_ipoib,rdma_ucm,rdma_cm,ib_cm ib_core 59893 11 pib,ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,ib_sa,ib_mad ipv6 261354 33 ib_ipoib,ib_addr,ip6t_REJECT,nf_conntrack_ipv6,nf_defrag_ipv6
The above output is an example of a machine in which the IPoIB is loaded. One should pay attention to the 'ib_ipoib' at the beginning of the first line. If this model isn't loaded, one needs to load it.
Note: This output may look different in different operating systems or configurations (i.e., name of loaded kernel modules may be other).
Loading the IPoIB module
One can load the IPoIB module as part of the RDMA service on his OS/system. In the configuration file of the RDMA stack the IPoIB should be set to 'load'.
The following command line will load the IPoIB module manually:
[root@localhost]# modprobe ib_ipoib
Unloading the IPoIB module
One can unload the IPoIB module as part of the RDMA service in his OS/system.
The following command line will unload the IPoIB module manually:
[root@localhost]# modprobe -r ib_ipoib
IPoIB network interface control
Show available IPoIB network interfaces
When the IPoIB kernel module is loaded, for every port of every local InfiniBand device, a network interface will be created. The following command shows the network interfaces that are available in the machine:
[root@localhost]# ifconfig -a eth0 Link encap:Ethernet HWaddr 00:0C:29:63:2D:42 inet addr:192.168.2.106 Bcast:192.168.2.255 Mask:255.255.255.0 inet6 addr: fe80::20c:29ff:fe63:2d42/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:409 errors:0 dropped:0 overruns:0 frame:0 TX packets:544 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:51996 (50.7 KiB) TX bytes:175118 (171.0 KiB) Interrupt:18 Base address:0x2000 Ifconfig uses the ioctl access method to get the full address information, which limits hardware addresses to 8 bytes. Because Infiniband address has 20 bytes, only the first 8 bytes are displayed correctly. Ifconfig is obsolete! For replacement check ip. ib0 Link encap:InfiniBand HWaddr 80:36:24:52:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 BROADCAST MULTICAST MTU:4092 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Ifconfig uses the ioctl access method to get the full address information, which limits hardware addresses to 8 bytes. Because Infiniband address has 20 bytes, only the first 8 bytes are displayed correctly. Ifconfig is obsolete! For replacement check ip. ib1 Link encap:InfiniBand HWaddr 80:36:24:53:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 BROADCAST MULTICAST MTU:4092 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
The IPoIB network interfaces are the ones that have the prefix 'ib'.
As one can see, 'ifconfig' prints a warning that it has trouble printing the MAC address, since it has many (i.e. 20) bytes. Fortunately, the 'ip' command can help us and show the MAC address of such a network interface. The following command shows the MAC address of the network interface 'ib0':
[root@localhost]# ip addr show ib0 3: ib0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN qlen 256 link/infiniband 80:36:24:52:fe:80:00:00:00:00:00:00:00:0c:29:63:2d:42:03:01 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
Configuring IP address of IPoIB interface
Configuring IPoIB interface manually
By default, unless someone configures the IPoIB network interfaces manually or automatically, those interfaces don't have any configured IP address and are down. One can configure those network interfaces, like any other network interface, using 'ifconfig'. The following command line will configure an IP address and netmask to the IPoIB network interface 'ib0':
[root@localhost]# ifconfig ib0 11.12.1.1/16
This configuration takes effect immediately without restarting the machine or any service. However, it isn't persistent and will disappear on machine reboot.
Now, that the IP address of that network interface is configured, it is ready to work and looks like this:
[root@localhost]# ifconfig ib0 Ifconfig uses the ioctl access method to get the full address information, which limits hardware addresses to 8 bytes. Because Infiniband address has 20 bytes, only the first 8 bytes are displayed correctly. Ifconfig is obsolete! For replacement check ip. ib0 Link encap:InfiniBand HWaddr 80:36:24:54:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:11.12.1.1 Bcast:11.12.255.255 Mask:255.255.0.0 inet6 addr: fe80::20c:2963:2d42:301/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:5 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Configuring IPoIB interface using configuration file
IPoIB network interface can be configured using Linux network configuration files, like any other network interface. One should create configuration file for every network interface and fill it with the needed information. Here is an example of such a configuration file - the file '/etc/sysconfig/network-scripts/ifcfg-ib0', which configures the network interface 'ib0':
DEVICE=ib0
BOOTPROTO=none# This line ensures that the interface will be brought up during boot.
ONBOOT=yes
# ib0 - This is the main IP address that will be used for most outbound connections.
# The address, netmask, and gateway are all necessary.
IPADDR=11.12.1.1
NETMASK=255.255.0.0
GATEWAY=11.12.100.1
When this configuration file exists, every time the networking service creates this network interface, it will configure those attributes automatically.
Loading this configuration file after creating/changing it can be done by rebooting the machine or restarting the networking service, according to the distribution. In many Linux distributions, the following command line will restart the networking service:
Controlling IPoIB mode
As explained in the previous post, IPoIB has two working modes:
- Datagram mode
- Connected mode
The working mode of every IPoIB network interface is independent and different interfaces, even on the same physical InfiniBand device, can work with other modes.
Changing an IPoIB interface to work in configuration file
Some RDMA distributions, such as MLNX-OFED, support configuring the IPoIB working mode using the service configuration file. In those distributions, setting the parameter 'SET_IPOIB_CM' to 'yes' will configure all available IPoIB network interfaces to Connected mode. Otherwise, they will be loaded in Datagram mode.
Changing an IPoIB interface to work in Datagram mode manually
Changing the working mode of an IPoIB network interface to work in datagram mode can be done by writing the word 'datagram' to a control file in the sysfs of that interface. For example:
[root@localhost]# echo datagram > /sys/class/net/ib0/mode
Changing IPoIB interface to work in Connected mode manually
Changing the working mode of an IPoIB network interface to work in connected mode can be done by writing the word 'connected' to a control file in the sysfs of that interface. For example:
[root@localhost]# echo connected > /sys/class/net/ib0/mode
Checking working mode of an IPoIB interface
Checking the working mode of an IPoIB network interface can be done by printing the file's content that controls the working mode. For example:
[root@localhost]# cat /sys/class/net/ib0/mode connected
Configuring the MTU of an IPoIB network interface
Like any other network interface, the IPoIB network interface can be changed.
The maximum supported value of an IPoIB network interface depends on the working mode.
- For datagram mode, the maximum MTU value depends on the used IPoIB multicast MTU size minus the IPoIB encapsulation header (4 bytes).
- For 2KB IB MTU: the maximum MTU can be 2044 bytes.
- For 4KB IB MTU: the maximum MTU can be 4092 bytes.
- For connected mode, the maximum MTU can be 65520 bytes.
Changing IPoIB interface MTU manually
An IPoIB network interface configuration can be like any other network interface using 'ifconfig'. The following command line changes the MTU of the network interface 'ib0' to 2000 bytes:
[root@localhost]# ifconfig ib0 mtu 2000
Changing IPoIB interface MTU with a configuration file
Some RDMA distributions, such as MLNX-OFED, support configuring the IPoIB working mode using the service file. In most such distributions, setting the parameter 'IPOIB_MTU' to the size of the MTU when working with connected mode. Otherwise, the interface's MTU won't be changed and will work with the default value.
Another option to change the configuration of an IPoIB network interface can be using its system configuration file, which makes this configuration persistent during machine reboot. Changing the interface configuration file /etc/sysconfig/network-scripts/ifcfg-[interface name] (different Linux distributions may have other places for this file) and set the line to the MTU size in bytes. For example, setting the following line in the configuration file will configure the MTU to be 2000 bytes.
MTU="2000"
Partitioning in IPoIB (VLAN equivalent)
When the IPoIB driver is loaded, it creates, by default, one interface for each port of the available InfiniBand devices using the P_Key value at index 0 of the P_Key table in that port.
Configuring OpenSM to support partitions
When 'OpenSM' runs on the host, one can change its configuration file to support more partitions. The configuration file /etc/rdma/partitions.conf controls the partitions that will be configured in the subnet by OpenSM. Different versions of OpenSM may have another default place for the partition file (the man page of the installed OpenSM will show that place). One can use the '-P' parameter to point explicitly to a specific path of this file.
Here is an example of such a configuration file which configures the default P_Key (0xffff) and another P_Key (0x8001) in the fabric:
MyNet0=0x0001, ipoib: ALL=full ;
The valid P_Key values in this configuration file are 0x0001-0x7fff.
Now that the configuration file was updated, one should restart OpenSM. This can be done using the following command line:
[root@localhost]# service opensmd restart
Note: OpenSM supports configuring specific ports with full membership, whereas other ports will be configured with partial membership. Explaining how to do so is out of the scope of this post.
Verifying the configured partitions on local port
The configured partitions for every InfiniBand device's port can be found in the sysfs. For example, the following command line prints the non-zero configured partitions in port 1 of the InfiniBand device 'mlx4_0':
[root@localhost]# cat /sys/class/infiniband/mlx4_0/ports/1/pkeys/* | grep -v 0x0000 0xffff 0x8001
Creating a network interface with a P_Key
To create an interface with a different P_Key, write the desired P_Key value into the main interface's
/sys/class/net/[interface name]/create_child file. For example, the following command will create a child interface using P_Key 0x8001 for the IPoIB network interface 'ib0':
[root@localhost]# echo 0x8001 > /sys/class/net/ib0/create_child
A new network interface, with the name 'ib0.8001' with P_Key value 0x8001 will be created.
Note: IPoIB network interface can be created using a P_Key value, even if that P_Key value isn't configured in that port's P_Key table.
Removing a network interface with a P_Key
To remove a subinterface that was created with a specific P_Key, write this P_Key value into the main interface's
/sys/class/net/[interface name]/delete_child file. For example:
[root@localhost]# echo 0x8001 > /sys/class/net/ib0/delete_child
Verifying that IPoIB is working
Since IPoIB provides a fully functional and working network interface, to verify that it is appropriately configured, one can use 'ping' to a remote IP address of another IPoIB network interface in the subnet and verify that there aren't any dropped packets:
[root@localhost]# $ ping 12.4.12.3 PING 12.4.12.3 56(84) bytes of data. 64 bytes from localhost (12.4.12.3): icmp_seq=1 ttl=64 time=0.051 ms 64 bytes from localhost (12.4.12.3): icmp_seq=2 ttl=64 time=0.055 ms ^C --- localhost ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 999ms rtt min/avg/max/mdev = 0.051/0.053/0.055/0.002 ms
Comments
Tell us what do you think.
Dotan thank you VERY MUCH for taking the time to write this.
Sincerely,
Stephen
:)
Thanks
Dotan
Hi Dotan,
can u clarify gen2 ib stack or gen-2 rdma
was there a gen1? what is it?
Hi.
Gen2 are the libibverbs and the RDMA core which currently exists in the Linux kernel
(all of the above are published in the OFED package).
Gen1 was based on VAPI (verbs implementation released by Mellanox Technologies),
(which was published in the IBGD package). This package support really old HCAs.
You can refer to gen1 as ancient history
:)
Thanks
Dotan
Hi, What if I do not have sudo permission?
Thanks a lot!
Yang
Hi.
Unless system is configured to allow this,
you'll have a problem to lock enough memory pages required for RDMA resources.
Thanks
Dotan
Hi Dotan.
If I unload the IPoIB module and run tcp applications on RDMA NICs, does it means that I use kernel tcp stack for communication?
Thanks.
Yes.
Thanks
Dotan
Hi Dotan,
it was nice blog but it raised my curiosity about the real use of IPoIB.
can you please throw some light on following:
1. Is there any real world application/service which uses IPoIB ?
2. Is it developed for legacy (non MPI)applications to be able to run over high speed application. if yes, any example of such legacy applications ?
3. Is it targeted for only Data center workload OR has any significance in HPC cluster/machine ?
Hi.
1. You can build a cluster with only IB adapters and use IP-based applications over IPoIB
Or, you can use IPoIB interfaces for data fabric (for example, MPI, etc.).
2. It is developed for *ANY* IP-based application (ping, FTP, ssh, whatever).
Any application which opens a socket can use IPoIB interfaces.
3. It is target to any user that has IP-based (i.e. legacy) and wants to get better performance
(compare to the server's 10g NIC).
Thanks
Dotan