Single Node Kubeflow cluster with Nvidia GPU support
Kubeflow
At Kubeflow 1.4 on a Minikube Kubernetes Node I described the setup of a Kubeflow/Minikube setup. Still getting the GPUs provided to Kubeflow is difficult if Minikube is used. Minikube only supports GPUs for selected drivers. The none
driver is not recommend, https://minikube.sigs.k8s.io/docs/drivers/none/, and the kvm2
driver adds another layer of virtualization I like to avoid.
Still, Minikube has the advantage that it provides network plugins and storage provisioners out of the box. However, I went the long and difficult road to setup a pure Kubernetes cluster for Kubeflow which includes Calico as network plugin and Rook/Ceph as storage provisioner. Such a setup, scales easily to a multi-node cluster, since Kubernetes as well as Rook/Ceph are designed for such use cases. Only due to our limited hardware equipment, we restrict our setup to a single-node. A multi-node cluster would offer redundancy, high-availabily and scalability, characteristics desired in production environments.
Our Hardware Setup
Our setup is based on an ASUS ESC4000A-E10 with
- AMD Epyc CPU 7443P
- 128 GB RAM DDR4-3200
- 1x Nvidia RTX 3900 GPU
- 1 TB NVMe M.2 SSD system partition
- 3.84 TB NVMe 2.5" SSD data partition
Currently, we only deploy one GPU for machine learning tasks, still this server can provide up to four cards.
Our Software Setup
The most difficult part is the selection to find a set of working components, so far I used:
- Ubuntu 20.04 LTS with hwe kernel, the hwe kernel improves the tensorflow benchmark performance compared to the default kernel 5.4. The hwe kernel can easily be installed with:
sudo apt install --install-recommends linux-generic-hwe-20.04-edge
- Docker and Containerd
- I tried podman and cri-o, too. But the cert-manager-webhook deployment does not start, using docker and containerd solves the problem.
- Kubernetes 1.21
- API changes in Kubernetes 1.22 hinder the use of more recent versions. Version 1.21 is still supported. Officially tested is Kubeflow with 1.19.
- Rook and Ceph as storage provider
- Kubeflow uses persistent volume claims, therefore a storage provider is required that can provide them.
- The current version is v1.8.2.
- Nvidia GPU Operator to make GPUs available to notebooks
- Kustomize v3.2.0
- Since Kubeflow 1.3
kustomize
manifests are used to deploy Kubeflow. Nevertheless, only version up to 3.2.0 are supported, as described at https://github.com/kubeflow/.manifests#installation, new versions do not work
- Since Kubeflow 1.3
- Kubeflow 1.4
Software Installation
This section gives a brief summary of the commands used for installation and references to the related documentation.
Install docker and containerd
See https://docs.docker.com/engine/install/ubuntu/
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
sudo usermod -aG docker $USER && newgrp docker
The current dockerd version is 20.10.12.
Install Kubernetes
Kubeflow 1.4 is tested with 1.19, we use the newer Kubernetes release 1.21. Newer Kubernetes versions do not work due to API changes.
Since we use Ubuntu with systemd, configure docker to use systemd cgroup driver and also the preferred storage driver, see https://kubernetes.io/docs/setup/production-environment/container-runtimes/#docker:
sudo mkdir /etc/docker
cat <<EOF | sudo tee /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=systemd"],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m"
},
"storage-driver": "overlay2"
}
EOF
and restart docker
sudo systemctl enable docker
sudo systemctl daemon-reload
sudo systemctl restart docker
afterwards install Kubernetes.
Prepare the software repositories
sudo apt-get install -y apt-transport-https ca-certificates curl
sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update
Install kubernetes command line tools
Install the latest minior release of 1.21, which is 1.21.8.
KVER=1.21.8-00
sudo apt-get install -y kubelet=$KVER kubeadm=$KVER kubectl=$KVER
sudo apt-mark hold kubelet kubeadm kubectl
Set the packages to hold to avoid upgrades.
Init the Cluster and install the Network Plugin Calico
Note, that the master role is removed for our single node cluster, so that pods are scheduled in the master, too.
sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --kubernetes-version="1.21.8"
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
kubectl create -f https://docs.projectcalico.org/manifests/tigera-operator.yaml
kubectl create -f https://docs.projectcalico.org/manifests/custom-resources.yaml
kubectl taint nodes --all node-role.kubernetes.io/master-
Install the storage provider
Kubeflow requires a storage provider. For our setup, we use Ceph deployed by Rook. Therefore, we provide two spare disks that are initialized by Ceph. Ensure that the disks do not have any filesystem, otherwise Ceph does not use them.
To wipe a filesystem see https://rook.github.io/docs/rook/latest/ceph-teardown.html.
Typically, Rook/Ceph is used in a multi-node cluster for high-availability. Here, we make a single node deployment. Example yaml-files are already included at https://github.com/rook/rook.git. The single-node specific configurations are in cluster-test.yaml
and storageclass-test.yaml
. Details see below.
Install a single node Rook/Ceph cluster:
git clone --single-branch --branch master https://github.com/rook/rook.git
cd rook/deploy/examples
kubectl create -f crds.yaml -f common.yaml -f operator.yaml
kubectl create -f cluster-test.yaml
and create the storage class used for the persistent volume claims:
cd csi/rbd
kubectl create -f storageclass-test.yaml
Make this class the default class:
kubectl patch storageclass rook-ceph-block -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
Add GPU support
Different ways exist as described at https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html .
We select the Nvidia GPU operator, which handles the installation of drivers and additional required libraries.
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 && chmod 700 get_helm.sh && ./get_helm.sh
helm repo add nvidia https://nvidia.github.io/gpu-operator && helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator
Test with
cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvidia/samples:vectoradd-cuda11.2.1"
resources:
limits:
nvidia.com/gpu: 1
EOF
and see the logs
kubectl logs cuda-vectoradd
Install Kustomize v3.2
wget https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
chmod +x kustomize_3.2.0_linux_amd64
sudo mv kustomize_3.2.0_linux_amd64 /usr/bin
sudo ln -s /usr/bin/kustomize_3.2.0_linux_amd64 /usr/bin/kustomize
Install Kubeflow 1.4
See https://github.com/kubeflow/manifests#installation
Preparation
Since Kubeflow 1.4 manifests are used to deploy Kubeflow.
Download the manifests, change into the directory, and checkout the latest release
git clone https://github.com/kubeflow/manifests.git
cd manifests
git checkout v1.4.1
For the next commands stay in this directory.
If desired, set a non-default user password for the default user. First create a password hash with python.
sudo apt install python3-passlib python3-bcrypt
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'
and set the hash
option in the file ./common/dex/base/config-map.yaml
to the generated password hash
vi ./common/dex/base/config-map.yaml
Install
Install all components with one commands, see https://github.com/kubeflow/manifests#install-with-a-single-command
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
Now we should have a running single node Kubeflow cluster, verify with kubectl get pods -A
that all pods are in the states Running
or Completed
.
Login via SSH port forwarding
So far, we do not make the web interface publicly available and use SSH port forwarding.
On the Kubeflow machine expose the port 8080:
kubectl port-forward -n istio-system service/istio-ingressgateway 8080:80 --address=0.0.0.0
On the connecting client forward the local port 8080 to the remote port 8080:
ssh -L 8080:localhost:8080 <remote-user>@<kubeflow-machine>
Open a web browser on the client and open localhost:8080
.
ToDo
- Adding users, see https://youtu.be/AjNbcMGl8Y4
- LDAP integration, see https://cloudadvisors.net/2020/09/23/ldap-active-directory-with-kubeflow-within-tkg/
- Make Kubeflow webinterface publicly available using e.g. MetalLB as described at https://www.kubeflow.org/docs/distributions/nutanix/install-kubeflow/