Exposing AMD GPU to Kubernates
Current Test System
At this guide I use this following spec :
AMD GPU Instinct MI210
ROCm 6.0
$ apt show rocm-libs -a
Package: rocm-libs
Version: 6.0.0.60000-91~20.04
Ubuntu 22.04
$ lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 20.04.6 LTS
Release: 20.04
Codename: focal
Docker 25
$ docker -v
Docker version 25.0.1, build 29cf629
Kubernates, Single Node Rancher System
You can refer to this link for kubernates singlenode installation
https://blog.alphabravo.io/posts/2021/single-node-rke2-pt1/
root@amdserver:/home/psi-admin/k8s-device-plugin# kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.6+rke2r1", GitCommit:"d921bc6d1810da51177fbd0ed61dc811c5228097", GitTreeState:"clean", BuildDate:"2021-10-28T16:49:09Z", GoVersion:"go1.16.9b7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.10+rke2r1", GitCommit:"0fa26aea1d5c21516b0d96fea95a77d8d429912e", GitTreeState:"clean", BuildDate:"2024-01-17T21:34:35Z", GoVersion:"go1.20.13 X:boringcrypto", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.21) and server (1.27) exceeds the supported minor version skew of +/-1
Reference
http://www.bytefold.com/sharing-gpu-in-kubernetes/
https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
Install kubernates plug-in for ROCM
Install the plug in using daemonset
git clone https://github.com/ROCm/k8s-device-plugin.git
Deploy the daemonset (it is a plug in installed on all amd gpu node)
cd k8s-device-plugin
kubectl create -f k8s-ds-amdgpu-dp.yaml
its Done,, just like that!
Verify the daemonset, its in kube-system namespaces :
root@amdserver:/home/psi-admin/k8s-device-plugin# kubectl get pods --all-namespaces | grep amdgpu
kube-system amdgpu-device-plugin-daemonset-vc2gm 1/1 Running 0 17h
Now lets try it !
cd examples/pod
kubectl apply -f alexnet-gpu.yaml
Verify the pods and check if it works
root@amdserver:/home/psi-admin/k8s-device-plugin/example/pod# kubectl get pods
NAME READY STATUS RESTARTS AGE
alexnet-tf-gpu-pod 1/1 Running 0 6s
Finally, Check the Result !
kubectl logs alexnet-tf-gpu-pod
It will run a simple benchmark test:
It works!