Working with IPoIB

Contents

5.00 avg. rating (99% score) - 4 votes

In this post, I will explain how to work with IPoIB: loading/unloading the IPoIB module, configuring it and verifying that it works.

IPoIB kernel module control

Verifying that the IPoIB module is loaded

The IPoIB kernel module name in Linux is 'ib_ipoib'. The following command line verifies that it is loaded:

[root@localhost]# lsmod | grep ipoib
ib_ipoib               68315  0
ib_cm                  30987  3 ib_ipoib,ib_ucm,rdma_cm
ib_sa                  19056  4 ib_ipoib,rdma_ucm,rdma_cm,ib_cm
ib_core                59893  11
pib,ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,ib_sa,ib_mad
ipv6                  261354  33
ib_ipoib,ib_addr,ip6t_REJECT,nf_conntrack_ipv6,nf_defrag_ipv6

The above output is an example of a machine in which the IPoIB is loaded. One should pay attention to the 'ib_ipoib' at the beginning of the first line. If this model isn't loaded, one needs to load it.

Note: This output may look different in different operating systems or configurations (i.e., name of loaded kernel modules may be other).

Loading the IPoIB module

One can load the IPoIB module as part of the RDMA service on his OS/system. In the configuration file of the RDMA stack the IPoIB should be set to 'load'.

The following command line will load the IPoIB module manually:

[root@localhost]# modprobe ib_ipoib

Unloading the IPoIB module

One can unload the IPoIB module as part of the RDMA service in his OS/system.

The following command line will unload the IPoIB module manually:

[root@localhost]# modprobe -r ib_ipoib

IPoIB network interface control

Show available IPoIB network interfaces

When the IPoIB kernel module is loaded, for every port of every local InfiniBand device, a network interface will be created. The following command shows the network interfaces that are available in the machine:

[root@localhost]# ifconfig -a
eth0      Link encap:Ethernet  HWaddr 00:0C:29:63:2D:42  
          inet addr:192.168.2.106  Bcast:192.168.2.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe63:2d42/64 Scope:Link              
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1              
          RX packets:409 errors:0 dropped:0 overruns:0 frame:0            
          TX packets:544 errors:0 dropped:0 overruns:0 carrier:0          
          collisions:0 txqueuelen:1000                                    
          RX bytes:51996 (50.7 KiB)  TX bytes:175118 (171.0 KiB)          
          Interrupt:18 Base address:0x2000                                

Ifconfig uses the ioctl access method to get the full address information,
which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed
correctly.                              
Ifconfig is obsolete! For replacement check ip.                                                                       
ib0       Link encap:InfiniBand  HWaddr
80:36:24:52:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00                   
          BROADCAST MULTICAST  MTU:4092  Metric:1                                                                     
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0                                                          
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0                                                        
          collisions:0 txqueuelen:256                                                                                 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)                                                                      

Ifconfig uses the ioctl access method to get the full address information,
which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed
correctly.                              
Ifconfig is obsolete! For replacement check ip.                                                                       
ib1       Link encap:InfiniBand  HWaddr
80:36:24:53:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00                   
          BROADCAST MULTICAST  MTU:4092  Metric:1                                                                     
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0                                                          
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0                                                        
          collisions:0 txqueuelen:256                                                                                 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

The IPoIB network interfaces are the ones that have the prefix 'ib'.

As one can see, 'ifconfig' prints a warning that it has trouble printing the MAC address, since it has many (i.e. 20) bytes. Fortunately, the 'ip' command can help us and show the MAC address of such a network interface. The following command shows the MAC address of the network interface 'ib0':

[root@localhost]# ip addr show ib0
3: ib0: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN qlen 256
    link/infiniband
80:36:24:52:fe:80:00:00:00:00:00:00:00:0c:29:63:2d:42:03:01 brd
00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff

Configuring IP address of IPoIB interface

Configuring IPoIB interface manually

By default, unless someone configures the IPoIB network interfaces manually or automatically, those interfaces don't have any configured IP address and are down. One can configure those network interfaces, like any other network interface, using 'ifconfig'. The following command line will configure an IP address and netmask to the IPoIB network interface 'ib0':

[root@localhost]# ifconfig ib0 11.12.1.1/16

This configuration takes effect immediately without restarting the machine or any service. However, it isn't persistent and will disappear on machine reboot.

Now, that the IP address of that network interface is configured, it is ready to work and looks like this:

[root@localhost]# ifconfig ib0
Ifconfig uses the ioctl access method to get the full address information,
which limits hardware addresses to 8 bytes.
Because Infiniband address has 20 bytes, only the first 8 bytes are displayed
correctly.
Ifconfig is obsolete! For replacement check ip.
ib0       Link encap:InfiniBand  HWaddr
80:36:24:54:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
          inet addr:11.12.1.1  Bcast:11.12.255.255  Mask:255.255.0.0
          inet6 addr: fe80::20c:2963:2d42:301/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:5 overruns:0 carrier:0
          collisions:0 txqueuelen:256
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

Configuring IPoIB interface using configuration file

IPoIB network interface can be configured using Linux network configuration files, like any other network interface. One should create configuration file for every network interface and fill it with the needed information. Here is an example of such a configuration file - the file '/etc/sysconfig/network-scripts/ifcfg-ib0', which configures the network interface 'ib0':

# Configuration for ib0
DEVICE=ib0
BOOTPROTO=none# This line ensures that the interface will be brought up during boot.
ONBOOT=yes
# ib0 - This is the main IP address that will be used for most outbound connections.
# The address, netmask, and gateway are all necessary.
IPADDR=11.12.1.1
NETMASK=255.255.0.0
GATEWAY=11.12.100.1

When this configuration file exists, every time the networking service creates this network interface, it will configure those attributes automatically.

Loading this configuration file after creating/changing it can be done by rebooting the machine or restarting the networking service, according to the distribution. In many Linux distributions, the following command line will restart the networking service:

[root@localhost]# service network restart

Controlling IPoIB mode

As explained in the previous post, IPoIB has two working modes:

Datagram mode
Connected mode

The working mode of every IPoIB network interface is independent and different interfaces, even on the same physical InfiniBand device, can work with other modes.

Changing an IPoIB interface to work in configuration file

Some RDMA distributions, such as MLNX-OFED, support configuring the IPoIB working mode using the service configuration file. In those distributions, setting the parameter 'SET_IPOIB_CM' to 'yes' will configure all available IPoIB network interfaces to Connected mode. Otherwise, they will be loaded in Datagram mode.

Changing an IPoIB interface to work in Datagram mode manually

Changing the working mode of an IPoIB network interface to work in datagram mode can be done by writing the word 'datagram' to a control file in the sysfs of that interface. For example:

[root@localhost]# echo datagram > /sys/class/net/ib0/mode

Changing IPoIB interface to work in Connected mode manually

Changing the working mode of an IPoIB network interface to work in connected mode can be done by writing the word 'connected' to a control file in the sysfs of that interface. For example:

[root@localhost]# echo connected > /sys/class/net/ib0/mode

Checking working mode of an IPoIB interface

Checking the working mode of an IPoIB network interface can be done by printing the file's content that controls the working mode. For example:

[root@localhost]# cat /sys/class/net/ib0/mode
connected

Configuring the MTU of an IPoIB network interface

Like any other network interface, the IPoIB network interface can be changed.
The maximum supported value of an IPoIB network interface depends on the working mode.

For datagram mode, the maximum MTU value depends on the used IPoIB multicast MTU size minus the IPoIB encapsulation header (4 bytes).
- For 2KB IB MTU: the maximum MTU can be 2044 bytes.
- For 4KB IB MTU: the maximum MTU can be 4092 bytes.
For connected mode, the maximum MTU can be 65520 bytes.

Changing IPoIB interface MTU manually

An IPoIB network interface configuration can be like any other network interface using 'ifconfig'. The following command line changes the MTU of the network interface 'ib0' to 2000 bytes:

[root@localhost]# ifconfig ib0 mtu 2000

Changing IPoIB interface MTU with a configuration file

Some RDMA distributions, such as MLNX-OFED, support configuring the IPoIB working mode using the service file. In most such distributions, setting the parameter 'IPOIB_MTU' to the size of the MTU when working with connected mode. Otherwise, the interface's MTU won't be changed and will work with the default value.

Another option to change the configuration of an IPoIB network interface can be using its system configuration file, which makes this configuration persistent during machine reboot. Changing the interface configuration file /etc/sysconfig/network-scripts/ifcfg-[interface name] (different Linux distributions may have other places for this file) and set the line to the MTU size in bytes. For example, setting the following line in the configuration file will configure the MTU to be 2000 bytes.

MTU="2000"

Partitioning in IPoIB (VLAN equivalent)

When the IPoIB driver is loaded, it creates, by default, one interface for each port of the available InfiniBand devices using the P_Key value at index 0 of the P_Key table in that port.

Configuring OpenSM to support partitions

When 'OpenSM' runs on the host, one can change its configuration file to support more partitions. The configuration file /etc/rdma/partitions.conf controls the partitions that will be configured in the subnet by OpenSM. Different versions of OpenSM may have another default place for the partition file (the man page of the installed OpenSM will show that place). One can use the '-P' parameter to point explicitly to a specific path of this file.

Here is an example of such a configuration file which configures the default P_Key (0xffff) and another P_Key (0x8001) in the fabric:

Default=0x7fff: ALL=full ;
MyNet0=0x0001, ipoib: ALL=full ;

The valid P_Key values in this configuration file are 0x0001-0x7fff.

Now that the configuration file was updated, one should restart OpenSM. This can be done using the following command line:

[root@localhost]# service opensmd restart

Note: OpenSM supports configuring specific ports with full membership, whereas other ports will be configured with partial membership. Explaining how to do so is out of the scope of this post.

Verifying the configured partitions on local port

The configured partitions for every InfiniBand device's port can be found in the sysfs. For example, the following command line prints the non-zero configured partitions in port 1 of the InfiniBand device 'mlx4_0':

[root@localhost]# cat /sys/class/infiniband/mlx4_0/ports/1/pkeys/*  | grep -v 0x0000
0xffff
0x8001

Creating a network interface with a P_Key

To create an interface with a different P_Key, write the desired P_Key value into the main interface's
/sys/class/net/[interface name]/create_child file. For example, the following command will create a child interface using P_Key 0x8001 for the IPoIB network interface 'ib0':

[root@localhost]# echo 0x8001 > /sys/class/net/ib0/create_child

A new network interface, with the name 'ib0.8001' with P_Key value 0x8001 will be created.

Note: IPoIB network interface can be created using a P_Key value, even if that P_Key value isn't configured in that port's P_Key table.

Removing a network interface with a P_Key

To remove a subinterface that was created with a specific P_Key, write this P_Key value into the main interface's
/sys/class/net/[interface name]/delete_child file. For example:

[root@localhost]# echo 0x8001 > /sys/class/net/ib0/delete_child

Verifying that IPoIB is working

Since IPoIB provides a fully functional and working network interface, to verify that it is appropriately configured, one can use 'ping' to a remote IP address of another IPoIB network interface in the subnet and verify that there aren't any dropped packets:

[root@localhost]# $ ping 12.4.12.3
PING 12.4.12.3 56(84) bytes of data.
64 bytes from localhost (12.4.12.3): icmp_seq=1 ttl=64 time=0.051 ms
64 bytes from localhost (12.4.12.3): icmp_seq=2 ttl=64 time=0.055 ms
^C
--- localhost ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 999ms
rtt min/avg/max/mdev = 0.051/0.053/0.055/0.002 ms

Written by: Dotan Barak on April 21, 2015.on March 10, 2023.

Comments

Tell us what do you think.

Stephen says: May 8, 2015

Dotan thank you VERY MUCH for taking the time to write this.

Sincerely,
Stephen

Reply
- Dotan Barak says: May 8, 2015
  
  :)
  
  Thanks
  Dotan
  
  Reply
Pankaj says: September 15, 2015

Hi Dotan,
can u clarify gen2 ib stack or gen-2 rdma
was there a gen1? what is it?

Reply
- Dotan Barak says: September 21, 2015
  
  Hi.
  
  Gen2 are the libibverbs and the RDMA core which currently exists in the Linux kernel
  (all of the above are published in the OFED package).
  
  Gen1 was based on VAPI (verbs implementation released by Mellanox Technologies),
  (which was published in the IBGD package). This package support really old HCAs.
  
  You can refer to gen1 as ancient history
  :)
  
  Thanks
  Dotan
  
  Reply
Yang Xia says: February 22, 2018

Hi, What if I do not have sudo permission?
Thanks a lot!
Yang

Reply
- Dotan Barak says: March 2, 2018
  
  Hi.
  
  Unless system is configured to allow this,
  you'll have a problem to lock enough memory pages required for RDMA resources.
  
  Thanks
  Dotan
  
  Reply
qiuhaonan says: April 23, 2019

Hi Dotan.
If I unload the IPoIB module and run tcp applications on RDMA NICs, does it means that I use kernel tcp stack for communication?
Thanks.

Reply
- Dotan Barak says: April 23, 2019
  
  Yes.
  
  Thanks
  Dotan
  
  Reply
Mahesh says: December 17, 2020

Hi Dotan,
it was nice blog but it raised my curiosity about the real use of IPoIB.
can you please throw some light on following:
1. Is there any real world application/service which uses IPoIB ?
2. Is it developed for legacy (non MPI)applications to be able to run over high speed application. if yes, any example of such legacy applications ?
3. Is it targeted for only Data center workload OR has any significance in HPC cluster/machine ?

Reply
- Dotan Barak says: February 28, 2021
  
  Hi.
  
  1. You can build a cluster with only IB adapters and use IP-based applications over IPoIB
  Or, you can use IPoIB interfaces for data fabric (for example, MPI, etc.).
  2. It is developed for *ANY* IP-based application (ping, FTP, ssh, whatever).
  Any application which opens a socket can use IPoIB interfaces.
  3. It is target to any user that has IP-based (i.e. legacy) and wants to get better performance
  (compare to the server's 10g NIC).
  
  Thanks
  Dotan
  
  Reply