Single Node Kubeflow cluster with Nvidia GPU support

Jan 5, 2022 6 min read

Kubeflow

At Kubeflow 1.4 on a Minikube Kubernetes Node I described the setup of a Kubeflow/Minikube setup. Still getting the GPUs provided to Kubeflow is difficult if Minikube is used. Minikube only supports GPUs for selected drivers. The none driver is not recommend, https://minikube.sigs.k8s.io/docs/drivers/none/, and the kvm2 driver adds another layer of virtualization I like to avoid.

Still, Minikube has the advantage that it provides network plugins and storage provisioners out of the box. However, I went the long and difficult road to setup a pure Kubernetes cluster for Kubeflow which includes Calico as network plugin and Rook/Ceph as storage provisioner. Such a setup, scales easily to a multi-node cluster, since Kubernetes as well as Rook/Ceph are designed for such use cases. Only due to our limited hardware equipment, we restrict our setup to a single-node. A multi-node cluster would offer redundancy, high-availabily and scalability, characteristics desired in production environments.

Our Hardware Setup

Our setup is based on an ASUS ESC4000A-E10 with

AMD Epyc CPU 7443P
128 GB RAM DDR4-3200
1x Nvidia RTX 3900 GPU
1 TB NVMe M.2 SSD system partition
3.84 TB NVMe 2.5" SSD data partition

Currently, we only deploy one GPU for machine learning tasks, still this server can provide up to four cards.

Our Software Setup

The most difficult part is the selection to find a set of working components, so far I used:

Ubuntu 20.04 LTS with hwe kernel, the hwe kernel improves the tensorflow benchmark performance compared to the default kernel 5.4. The hwe kernel can easily be installed with:

sudo apt install --install-recommends linux-generic-hwe-20.04-edge

Docker and Containerd
- I tried podman and cri-o, too. But the cert-manager-webhook deployment does not start, using docker and containerd solves the problem.
Kubernetes 1.21
- API changes in Kubernetes 1.22 hinder the use of more recent versions. Version 1.21 is still supported. Officially tested is Kubeflow with 1.19.
Rook and Ceph as storage provider
- Kubeflow uses persistent volume claims, therefore a storage provider is required that can provide them.
- The current version is v1.8.2.
Nvidia GPU Operator to make GPUs available to notebooks
Kustomize v3.2.0
- Since Kubeflow 1.3 kustomize manifests are used to deploy Kubeflow. Nevertheless, only version up to 3.2.0 are supported, as described at https://github.com/kubeflow/.manifests#installation, new versions do not work
Kubeflow 1.4

Software Installation

This section gives a brief summary of the commands used for installation and references to the related documentation.

Install docker and containerd

See https://docs.docker.com/engine/install/ubuntu/

sudo apt-get update
sudo apt-get install ca-certificates curl gnupg lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
sudo usermod -aG docker $USER && newgrp docker

The current dockerd version is 20.10.12.

Install Kubernetes

Kubeflow 1.4 is tested with 1.19, we use the newer Kubernetes release 1.21. Newer Kubernetes versions do not work due to API changes.

Since we use Ubuntu with systemd, configure docker to use systemd cgroup driver and also the preferred storage driver, see https://kubernetes.io/docs/setup/production-environment/container-runtimes/#docker:

sudo mkdir /etc/docker
cat <<EOF | sudo tee /etc/docker/daemon.json
{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m"
  },
  "storage-driver": "overlay2"
}
EOF

and restart docker

sudo systemctl enable docker
sudo systemctl daemon-reload
sudo systemctl restart docker

afterwards install Kubernetes.

Prepare the software repositories

sudo apt-get install -y apt-transport-https ca-certificates curl
sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update

Install kubernetes command line tools

Install the latest minior release of 1.21, which is 1.21.8.

KVER=1.21.8-00
sudo apt-get install -y kubelet=$KVER kubeadm=$KVER kubectl=$KVER
sudo apt-mark hold kubelet kubeadm kubectl

Set the packages to hold to avoid upgrades.

Init the Cluster and install the Network Plugin Calico

Note, that the master role is removed for our single node cluster, so that pods are scheduled in the master, too.

sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --kubernetes-version="1.21.8"

mkdir -p $HOME/.kube 
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

kubectl create -f https://docs.projectcalico.org/manifests/tigera-operator.yaml
kubectl create -f https://docs.projectcalico.org/manifests/custom-resources.yaml
kubectl taint nodes --all node-role.kubernetes.io/master-

Install the storage provider

Kubeflow requires a storage provider. For our setup, we use Ceph deployed by Rook. Therefore, we provide two spare disks that are initialized by Ceph. Ensure that the disks do not have any filesystem, otherwise Ceph does not use them.

To wipe a filesystem see https://rook.github.io/docs/rook/latest/ceph-teardown.html.

Typically, Rook/Ceph is used in a multi-node cluster for high-availability. Here, we make a single node deployment. Example yaml-files are already included at https://github.com/rook/rook.git. The single-node specific configurations are in cluster-test.yaml and storageclass-test.yaml. Details see below.

Install a single node Rook/Ceph cluster:

git clone --single-branch --branch master https://github.com/rook/rook.git
cd rook/deploy/examples
kubectl create -f crds.yaml -f common.yaml -f operator.yaml
kubectl create -f cluster-test.yaml

and create the storage class used for the persistent volume claims:

cd csi/rbd
kubectl create -f storageclass-test.yaml

Make this class the default class:

kubectl patch storageclass rook-ceph-block -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

Add GPU support

Different ways exist as described at https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html .

We select the Nvidia GPU operator, which handles the installation of drivers and additional required libraries.

From https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-nvidia-gpu-operator

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 && chmod 700 get_helm.sh && ./get_helm.sh
helm repo add nvidia https://nvidia.github.io/gpu-operator && helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator

Test with

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
    resources:
      limits:
         nvidia.com/gpu: 1
EOF

and see the logs

kubectl logs cuda-vectoradd

Install Kustomize v3.2

wget https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
chmod +x kustomize_3.2.0_linux_amd64
sudo mv kustomize_3.2.0_linux_amd64 /usr/bin
sudo ln -s /usr/bin/kustomize_3.2.0_linux_amd64 /usr/bin/kustomize

Install Kubeflow 1.4

See https://github.com/kubeflow/manifests#installation

Preparation

Since Kubeflow 1.4 manifests are used to deploy Kubeflow.

Download the manifests, change into the directory, and checkout the latest release

git clone https://github.com/kubeflow/manifests.git
cd manifests
git checkout v1.4.1

For the next commands stay in this directory.

If desired, set a non-default user password for the default user. First create a password hash with python.

sudo apt install python3-passlib python3-bcrypt
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

and set the hash option in the file ./common/dex/base/config-map.yaml to the generated password hash

vi ./common/dex/base/config-map.yaml

Install

Install all components with one commands, see https://github.com/kubeflow/manifests#install-with-a-single-command

while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Now we should have a running single node Kubeflow cluster, verify with kubectl get pods -A that all pods are in the states Running or Completed.

So far, we do not make the web interface publicly available and use SSH port forwarding.

On the Kubeflow machine expose the port 8080:

kubectl port-forward -n istio-system service/istio-ingressgateway 8080:80 --address=0.0.0.0

On the connecting client forward the local port 8080 to the remote port 8080:

ssh -L 8080:localhost:8080 <remote-user>@<kubeflow-machine>

Open a web browser on the client and open localhost:8080.

ToDo

Adding users, see https://youtu.be/AjNbcMGl8Y4
LDAP integration, see https://cloudadvisors.net/2020/09/23/ldap-active-directory-with-kubeflow-within-tkg/
Make Kubeflow webinterface publicly available using e.g. MetalLB as described at https://www.kubeflow.org/docs/distributions/nutanix/install-kubeflow/