Exposing AMD GPU to Kubernates

Current Test System

At this guide I use this following spec :

AMD GPU Instinct MI210
ROCm 6.0

$ apt show rocm-libs -a
Package: rocm-libs
Version: 6.0.0.60000-91~20.04
 Ubuntu 22.04

 $ lsb_release -a
 Distributor ID: Ubuntu
 Description: Ubuntu 20.04.6 LTS
 Release: 20.04
 Codename: focal

Docker 25

$ docker -v
Docker version 25.0.1, build 29cf629

Kubernates, Single Node Rancher System

You can refer to this link for kubernates singlenode installation

https://blog.alphabravo.io/posts/2021/single-node-rke2-pt1/

root@amdserver:/home/psi-admin/k8s-device-plugin# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.6+rke2r1", GitCommit:"d921bc6d1810da51177fbd0ed61dc811c5228097", GitTreeState:"clean", BuildDate:"2021-10-28T16:49:09Z", GoVersion:"go1.16.9b7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.10+rke2r1", GitCommit:"0fa26aea1d5c21516b0d96fea95a77d8d429912e", GitTreeState:"clean", BuildDate:"2024-01-17T21:34:35Z", GoVersion:"go1.20.13 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.27) exceeds the supported minor version skew of +/-1

Reference

http://www.bytefold.com/sharing-gpu-in-kubernetes/

https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

Install kubernates plug-in for ROCM

Install the plug in using daemonset

git clone https://github.com/ROCm/k8s-device-plugin.git

Deploy the daemonset (it is a plug in installed on all amd gpu node)

cd k8s-device-plugin
kubectl create -f k8s-ds-amdgpu-dp.yaml

its Done,, just like that!

Verify the daemonset, its in kube-system namespaces :

root@amdserver:/home/psi-admin/k8s-device-plugin# kubectl get pods --all-namespaces | grep amdgpu
kube-system amdgpu-device-plugin-daemonset-vc2gm 1/1 Running 0 17h

Now lets try it !

cd examples/pod
kubectl apply -f alexnet-gpu.yaml

Verify the pods and check if it works

root@amdserver:/home/psi-admin/k8s-device-plugin/example/pod# kubectl get pods
NAME READY STATUS RESTARTS AGE
alexnet-tf-gpu-pod 1/1 Running 0 6s

Finally, Check the Result !

kubectl logs alexnet-tf-gpu-pod

It will run a simple benchmark test:

It works!

Leave a Reply

Your email address will not be published. Required fields are marked *