RDMA, ROCE test in Network with MPI Tools

In the need for AI network infrastructure, an ROCE test in the network fabric is needed to ensure that the ongoing traffic is always in the low latency condition and get the best high bandwidth in the future.

In this section, we talk about how to test the ROCE network in the Kubernetes cluster premise.

There is 3 step to test the ROCE network in the kubernetes:

  • Build a tool container image

We need tools like: – RDMA perftest tool project, which provides ib_write|read|send_bw|lat tools

– some tools from Mellanox HCA driver package, such as show_gids

For the tools above, we can get them in distribution format in the Mellanox driver package.

$ cd ~/MLNX_OFED_LINUX-23.07-0.5.1.2-ubuntu20.04-x86_64/DEBS
$ ls -l | grep -E "openmpi|perftest|mpitest|iproute2|mlnx-tools"
-rw-r--r-- 1 clouduser clouduser   957268 Aug 24 17:02 mlnx-iproute2_6.3.0-1.2307050_amd64.deb
-rw-r--r-- 1 clouduser clouduser    69500 Aug 24 15:38 mlnx-tools_23.07-0.2307050_amd64.deb
-rw-r--r-- 1 clouduser clouduser   400172 Aug 24 16:42 mpitests_3.2.20-de56b6b.2307050_amd64.deb
-rw-r--r-- 1 clouduser clouduser 19481020 Aug 24 16:40 openmpi_4.1.5rc2-1.2307050_all.deb
-rw-r--r-- 1 clouduser clouduser   271036 Aug 24 15:54 perftest_23.07.0-0.25.g149fbd6.2307050_amd64.deb
$ dpkg-deb -c perftest_23.07.0-0.25.g149fbd6.2307050_amd64.deb | grep bin/ib_ 
-rwxr-xr-x root/root    248104 2015-03-29 08:58 ./usr/bin/ib_atomic_bw
-rwxr-xr-x root/root    239912 2015-03-29 08:58 ./usr/bin/ib_atomic_lat
-rwxr-xr-x root/root    248104 2015-03-29 08:58 ./usr/bin/ib_read_bw
-rwxr-xr-x root/root    239912 2015-03-29 08:58 ./usr/bin/ib_read_lat
-rwxr-xr-x root/root    252200 2015-03-29 08:58 ./usr/bin/ib_send_bw
-rwxr-xr-x root/root    248104 2015-03-29 08:58 ./usr/bin/ib_send_lat
-rwxr-xr-x root/root    248104 2015-03-29 08:58 ./usr/bin/ib_write_bw
-rwxr-xr-x root/root    239912 2015-03-29 08:58 ./usr/bin/ib_write_lat

And now we need the MPI enabled nccl test and compile it to our kubernetes environment.

FROM nvcr.io/nvidia/nemo:23.06
# The base image has already nccl-test commands and mpirun command but nccl-test is not compiled with mpi enabled.
WORKDIR /workspace

# copy all tools
COPY . .

# ****** Step1: install tools from Mellanox packages
# install some dependency packages
RUN apt-get update && apt-get install libpci3 libmnl0 libelf1 pciutils openssh-server iputils-ping -y --no-install-recommends && \
apt-get clean && \
rm -rf /var/lib/apt/lists/*
# Install the tools from Mellanox packages
# install mlnx iproute2 with /opt/mellanox/iproute2/sbin/ip command
# install ib_read/write_lat/bw commands
# install mpi test(osu* commands)
RUN dpkg -i *.deb && rm -f *.deb
ENV PATH="${PATH}:/opt/mellanox/iproute2/sbin"

# ****** Step2: compile nccl-tests tools with MPI enabled
# build mpi enabled nccl test commands at: /workspace/nccl-tests/build
RUN cd nccl-tests && make MPI=1 MPI_HOME=/opt/hpcx/ompi/

# ****** Step3: set up sshd environment for mpirun
# mpi-operator mounts the .ssh folder from a Secret. For that to work, we need
# to disable UserKnownHostsFile to avoid write permissions.
# Disabling StrictModes avoids directory and files read permission checks.
RUN echo "    UserKnownHostsFile /dev/null" >> /etc/ssh/ssh_config && echo "    StrictHostKeyChecking no" >> /etc/ssh/ssh_config &&\
    sed -i 's/#\(StrictModes \).*/\1no/g' /etc/ssh/sshd_config
RUN mkdir /run/sshd 

In above dockerfile, the first and second step is for install and compile tools. The third step is set up ssh environment for mpirun command under MPI operator framework.

To build the image:

$ docker build . --tag gpu02:5000/ea-bignlp/ga-participants/nemofw-training-ray-rdma-tools:23.08.02-gy 
  • Install the MPIOperator

We use the MPIOperator to set up the testing fabric. To deploy the operator we use:

$ kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.4.0/deploy/v2beta1/mpi-operator.yaml 
  • Start a testing fabric

Testing fabric is MPIJob, consists one launcher pod and some worker pods. To start the MPIJob use:

$ kubectl apply -f mpi-antiaffinity.yaml
$ kubectl get pod -owide
NAME                           READY   STATUS    RESTARTS   AGE   IP              NODE    NOMINATED NODE   READINESS GATES
mpi-affi-test-launcher-kdx6f   1/1     Running   0          78m   172.25.69.178   gpu01   <none>           <none>
mpi-affi-test-worker-0         1/1     Running   0          78m   172.25.69.177   gpu01   <none>           <none>
mpi-affi-test-worker-1         1/1     Running   0          78m   172.25.69.179   gpu01   <none>           <none> 
  • Run ROCE Test

Run Write Test Bandwidth between 2 worker pods

#*THIS IS SERVER SIDE*#

$ kubectl exec -ti mpi-affi-test-worker-0 -- bash
# ib_write_bw --report_gbits -x $(cat /etc/nccl.conf | cut -d"=" -f2)

************************************
* Waiting for client to connect... *
************************************
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 4
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x148b PSN 0xceed8e RKey 0x205900 VAddr 0x007f9c72884000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:02:02
 remote address: LID 0000 QPN 0x148d PSN 0x76d776 RKey 0x205a00 VAddr 0x007f37364ce000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:02:03
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 65536      5000             11531.98            11531.66                  0.184507
--------------------------------------------------------------------------------------- 
 #*THIS IS CLIENT SIDE*#

$ kubectl exec -ti mpi-affi-test-worker-1 -- bash
# ib_write_bw 192.168.2.2  --report_gbits -x $(cat /etc/nccl.conf | cut -d"=" -f2)
---------------------------------------------------------------------------------------
                    RDMA_Write BW Test
 Dual-port       : OFF          Device         : mlx5_0
 Number of qps   : 1            Transport type : IB
 Connection type : RC           Using SRQ      : OFF
 PCIe relax order: ON
 ibv_wr* API     : ON
 TX depth        : 128
 CQ Moderation   : 1
 Mtu             : 4096[B]
 Link type       : Ethernet
 GID index       : 8
 Max inline data : 0[B]
 rdma_cm QPs     : OFF
 Data ex. method : Ethernet
---------------------------------------------------------------------------------------
 local address: LID 0000 QPN 0x1493 PSN 0xcbf86a RKey 0x205900 VAddr 0x007f42e7e8a000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:02:03
 remote address: LID 0000 QPN 0x1494 PSN 0xce4fbe RKey 0x205a00 VAddr 0x007fdd0d213000
 GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:02:02
---------------------------------------------------------------------------------------
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
Conflicting CPU frequency values detected: 900.000000 != 3300.000000. CPU Frequency is not max.
 65536      5000             107.84             107.84             0.205690
---------------------------------------------------------------------------------------
  • Run mpi nccl test

And finally, we can run the test with MPI on 2 processes on multiple nodes with 4 GPUs each (8 in total)

 $ kubectl exec -ti mpi-affi-test-launcher-kdx6f -- mpirun --allow-run-as-root --map-by node -np 2 -x PATH -x NCCL_DEBUG=WARN  ./nccl-tests/build/all_reduce_ perf -b 8 -e 128M -f 2  -g 4
Warning: Permanently added 'mpi-affi-test-worker-1.mpi-affi-test-worker.default.svc' (ED25519) to the list of known hosts.
Warning: Permanently added 'mpi-affi-test-worker-0.mpi-affi-test-worker.default.svc' (ED25519) to the list of known hosts.
# nThread 1 nGpus 4 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    481 on mpi-affi-test-worker-0 device  0 [0x9c] NVIDIA A40
#  Rank  1 Group  0 Pid    481 on mpi-affi-test-worker-0 device  1 [0x9d] NVIDIA A40
#  Rank  2 Group  0 Pid    481 on mpi-affi-test-worker-0 device  2 [0xa0] NVIDIA A40
#  Rank  3 Group  0 Pid    481 on mpi-affi-test-worker-0 device  3 [0xa4] NVIDIA A40
#  Rank  4 Group  0 Pid    479 on mpi-affi-test-worker-1 device  0 [0x35] NVIDIA A40
#  Rank  5 Group  0 Pid    479 on mpi-affi-test-worker-1 device  1 [0x36] NVIDIA A40
#  Rank  6 Group  0 Pid    479 on mpi-affi-test-worker-1 device  2 [0x39] NVIDIA A40
#  Rank  7 Group  0 Pid    479 on mpi-affi-test-worker-1 device  3 [0x3d] NVIDIA A40
NCCL version 2.18.3+cuda12.2
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    19.72    0.00    0.00      0    19.44    0.00    0.00      0
          16             4     float     sum      -1    20.03    0.00    0.00      0    19.46    0.00    0.00      0
          32             8     float     sum      -1    19.79    0.00    0.00      0    19.43    0.00    0.00      0
          64            16     float     sum      -1    20.28    0.00    0.01      0    20.24    0.00    0.01      0
         128            32     float     sum      -1    21.17    0.01    0.01      0    20.72    0.01    0.01      0
         256            64     float     sum      -1    20.84    0.01    0.02      0    20.87    0.01    0.02      0
         512           128     float     sum      -1    22.66    0.02    0.04      0    22.27    0.02    0.04      0
        1024           256     float     sum      -1    24.04    0.04    0.07      0    23.80    0.04    0.08      0
        2048           512     float     sum      -1    27.08    0.08    0.13      0    26.74    0.08    0.13      0
        4096          1024     float     sum      -1    29.44    0.14    0.24      0    29.08    0.14    0.25      0
        8192          2048     float     sum      -1    31.49    0.26    0.46      0    30.59    0.27    0.47      0
       16384          4096     float     sum      -1    40.77    0.40    0.70      0    40.94    0.40    0.70      0
       32768          8192     float     sum      -1    57.02    0.57    1.01      0    57.29    0.57    1.00      0
       65536         16384     float     sum      -1    87.10    0.75    1.32      0    88.36    0.74    1.30      0
      131072         32768     float     sum      -1    96.20    1.36    2.38      0    94.14    1.39    2.44      0
      262144         65536     float     sum      -1    119.1    2.20    3.85      0    117.5    2.23    3.90      0
      524288        131072     float     sum      -1    158.9    3.30    5.78      0    158.6    3.30    5.78      0
     1048576        262144     float     sum      -1    230.7    4.54    7.95      0    229.7    4.56    7.99      0
     2097152        524288     float     sum      -1    378.3    5.54    9.70      0    377.0    5.56    9.73      0
     4194304       1048576     float     sum      -1    699.7    5.99   10.49      0    696.3    6.02   10.54      0
     8388608       2097152     float     sum      -1   1283.8    6.53   11.43      0   1285.2    6.53   11.42      0
    16777216       4194304     float     sum      -1   2522.8    6.65   11.64      0   2525.0    6.64   11.63      0
    33554432       8388608     float     sum      -1   5094.7    6.59   11.53      0   5088.9    6.59   11.54      0
    67108864      16777216     float     sum      -1   9958.5    6.74   11.79      0   9943.2    6.75   11.81      0
   134217728      33554432     float     sum      -1    19668    6.82   11.94      0    19647    6.83   11.96      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 4.10502 
#

For more details about the arguments of command to run the test with MPI, please have a look at the mpirun nccl test.

Reference:

– NVIDIA deepops K8S ROCE performance testing

– Dist package NCCL download

– mpirun nccl test]

– RDMA testing tools]

Leave a Reply

Your email address will not be published. Required fields are marked *