Single Node Kubeflow 1.7 cluster with Nvidia GPU support


At Kubeflow 1.4 on a Minikube Kubernetes Node I described the setup of a Kubeflow/Minikube setup. Still getting the GPUs provided to Kubeflow is difficult if Minikube is used. Minikube only supports GPUs for selected drivers. The none driver is not recommend,, and the kvm2 driver adds another layer of virtualization I like to avoid.

Still, Minikube has the advantage that it provides network plugins and storage provisioners out of the box. However, I went the long and difficult road to setup a pure Kubernetes cluster for Kubeflow which includes Calico as network plugin and Rook/Ceph as storage provisioner. Such a setup, scales easily to a multi-node cluster, since Kubernetes as well as Rook/Ceph are designed for such use cases. Only due to our limited hardware equipment, we restrict our setup to a single-node. A multi-node cluster would offer redundancy, high-availabily and scalability, characteristics desired in production environments.

Our Hardware Setup

Our setup is based on an ASUS ESC4000A-E10 with

  • 2x AMD EPYC 7413 24
  • 512 GB RAM DDR4-3200
  • 7x Nvidia A30 GPU
  • 2x 1,92 TB SSD/NVMe 2.5" system partition as software raid 1
  • 2x 15,3 TB NVMe M.2 data partition as Ceph redundant partitions.

Our Software Setup

The most difficult part is the selection to find a set of working components, so far I used:

I installed Ubuntu 22.04 with:

  1. Containerd
    • In previous installations, it was difficult to find a working container engine, for now containerd seems to work good.
  2. Kubernetes 1.25
    • Kubeflow 1.7 is tested with Kubeflow 1.24/1.25.
  3. Rook and Ceph as storage provider
    • Kubeflow uses persistent volume claims, therefore a storage provider is required that can provide them. We use Ceph as single node installation.
  4. Nvidia GPU Operator to make GPUs available to notebooks
  5. Kustomize v5.0.3
    • Since Kubeflow 1.3 kustomize manifests are used to deploy Kubeflow.
  6. Kubeflow 1.7

Software Installation

This section gives a brief summary of the commands used for installation and references to the related documentation.


Create /etc/sysctl.d/kubeflow.conf and insert

fs.inotify.max_user_instances = 1280

which seems to solve the problem related to

According to, load required kernel modules and enable bridging and forwarding.

cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf

sudo modprobe overlay
sudo modprobe br_netfilter

# sysctl params required by setup, params persist across reboots
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1

# Apply sysctl params without reboot
sudo sysctl --system

Install containerd


sudo apt-get update
sudo apt-get install ca-certificates curl gnupg lsb-release
curl -fsSL | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install

Install Kubernetes

Kubeflow 1.7 is tested with 1.24/1.25, we use the newer Kubernetes release 1.25.

Since we use Ubuntu with systemd, configure containerd to use systemd cgroup driver, see

sudo su
containerd config default > /etc/containerd/config.toml

Set SystemdCgroup = true in section [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] in /etc/containerd/config.toml.

and restart `containerd``

sudo systemctl restart containerd

afterwards install Kubernetes.

Prepare the software repositories

sudo apt-get install -y apt-transport-https ca-certificates curl
curl -fsSL | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-archive-keyring.gpg
echo "deb [signed-by=/etc/apt/keyrings/kubernetes-archive-keyring.gpg] kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list
sudo apt update

Install kubernetes command line tools

Install the latest minior release of 1.25, which is 1.25.11-00.

sudo apt-get install -y kubelet=$KVER kubeadm=$KVER kubectl=$KVER
sudo apt-mark hold kubelet kubeadm kubectl

Set the packages to hold to avoid upgrades.

Init the Cluster and install the Network Plugin Calico

Note, that the master role is removed for our single node cluster, so that pods are scheduled in the master, too.

sudo kubeadm init --pod-network-cidr= --kubernetes-version="1.25"

mkdir -p $HOME/.kube 
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

kubectl create -f
kubectl create -f
kubectl taint nodes --all

We also increase number if maximum pods from 110 to 200 in /etc/systemd/system/kubelet.service.d/10-kubeadm.conf and add --max-pods=243 to ExexStart since Kubeflow schedules pods for any user logged in once.

Install the storage provider

Kubeflow requires a storage provider. For our setup, we use Ceph deployed by Rook. Therefore, we provide two spare disks that are initialized by Ceph. Ensure that the disks do not have any filesystem, otherwise Ceph does not use them.

To wipe a filesystem see

Typically, Rook/Ceph is used in a multi-node cluster for high-availability. Here, we make a single node deployment. Example yaml-files are already included at The single-node specific configurations are in cluster-test.yaml and storageclass-test.yaml. Details see below.

Install a single node Rook/Ceph cluster:

git clone --single-branch --branch master
cd rook/deploy/examples
kubectl create -f crds.yaml -f common.yaml -f operator.yaml
kubectl create -f cluster-test.yaml

and create the storage class used for the persistent volume claims:

cd csi/rbd
kubectl create -f storageclass-test.yaml

Make this class the default class:

kubectl patch storageclass rook-ceph-block -p '{"metadata": {"annotations":{"":"true"}}}'

Check that the OSD pods rook-ceph-osd-... are running with

kubectl -n rook-ceph get pods

Check the rook/ceph storage health status, see, with the toolbox:

#Change back to deploy/examples in the rook folder
cd ../..
kubectl create -f toolbox.yaml
TOOLS_POD=$(kubectl -n $ROOK_CLUSTER_NAMESPACE get pod -l "app=rook-ceph-tools" -o jsonpath='{.items[*]}')
kubectl -n $ROOK_CLUSTER_NAMESPACE exec -it $TOOLS_POD -- ceph status

Add GPU support

Different ways exist as described at .

We select the Nvidia GPU operator, which handles the installation of drivers and additional required libraries.


curl -fsSL -o && chmod 700 && ./
helm repo add nvidia && helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator

We had to restart the server, so that GPU-Operator initializes sucessfully.

Test with

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
  name: cuda-vectoradd
  restartPolicy: OnFailure
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.2.1"
      limits: 1

and see the logs

kubectl logs cuda-vectoradd

Install Kustomize v5.0.3

tar xvzf kustomize_v5.0.3_linux_amd64.tar.gz
sudo mv kustomize /usr/bin

Install Kubeflow 1.7



Since Kubeflow 1.4 manifests are used to deploy Kubeflow.

Download the manifests, change into the directory, and checkout the latest release

git clone
cd manifests
git checkout v1.7.0

For the next commands stay in this directory.

If desired, set a non-default user password for the default user. First create a password hash with python.

sudo apt install python3-passlib python3-bcrypt
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'

and set the hash option in the file ./common/dex/base/config-map.yaml to the generated password hash

vi ./common/dex/base/config-map.yaml

Bugfix login stucks in infinite loop, issue see,due to outdated image of authservice, bugfix see or change referenced image in common/oidc-authservice/base/kustomization.yaml from

newTag: e236439


newTag: 0c4ea9a


Install all components with one commands, see

while ! kustomize build example | awk '!/well-defined/' | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

Now we should have a running single node Kubeflow cluster, verify with kubectl get pods -A that all pods are in the states Running or Completed.

Login via SSH port forwarding

So far, we do not make the web interface publicly available and use SSH port forwarding.

On the Kubeflow machine expose the port 8080:

kubectl port-forward -n istio-system service/istio-ingressgateway 8080:80 --address=

On the connecting client forward the local port 8080 to the remote port 8080:

ssh -L 8080:localhost:8080 <remote-user>@<kubeflow-machine>

Open a web browser on the client and open localhost:8080.

For a better usability use a loadbalancer.

Setup the Loadbalancer with TLS

Adapted from

Requirement: Create a valid certificate before.

kubectl apply -f

Specify a IPAddressPool in pool.yaml with a single address:

kind: IPAddressPool
  name: first-pool
  namespace: metallb-system

Specify an advertisement in advert.yaml:

kind: L2Advertisement
  name: example
  namespace: metallb-system
kubectl apply -f pool.yaml
kubectl apply -f advert.yaml

Have your certificate cert.pem and encrypted private file key.pem ready and add it:

kubectl create -n istio-system secret tls kubeflowcrt --key=key.pem --cert=cert.pem

Now adapt the istio kubeflow gateway with

kubectl -n kubeflow edit kubeflow-gateway

set the spec section to

    istio: ingressgateway
  - hosts:
    - '*'
      name: http
      number: 80
      protocol: HTTP
      httpsRedirect: true
  - hosts:
    - '*'
      name: https
      number: 443
      protocol: HTTPS
      credentialName: kubeflowcrt
      mode: SIMPLE

Change the type of the istio-ingressgateway service to LoadBalancer and get the IP

kubectl -n istio-system  patch service istio-ingressgateway -p '{"spec": {"type": "LoadBalancer"}}'
kubectl -n istio-system get svc istio-ingressgateway -o jsonpath='{.status.loadBalancer.ingress[0]}'

This should be the IP you have configured for metallb.

Add REDIRECT_URL in oidc-authservice-parameters configmap to data section, where x.x.x.x is your IP address or if you have DNS name use this instead:

kubectl -n istio-system edit configmap oidc-authservice-parameters
apiVersion: v1
  OIDC_AUTH_URL: /dex/auth
  OIDC_PROVIDER: http://dex.auth.svc.cluster.local:5556/dex
  OIDC_SCOPES: profile email groups
  PORT: '"8080"'
  REDIRECT_URL: https://x.x.x.x/login/oidc

Append https://x.x.x.x/login/oidc also to redirectURIs in the dex configmap

kubectl -n auth edit configmap dex

Rollout and restart services

kubectl -n istio-system rollout restart statefulset authservice
kubectl -n auth rollout restart deployment dex

Now Kubeflow should accessible via https://x.x.x.x.

LDAP integration

Change accordingly the fields and the filter

Get the current config:

kubectl get configmap dex -n auth -o jsonpath='{.data.config\.yaml}' > dex-config.yaml

Add the LDAP connector:

cat << EOF >> dex-config.yaml
- type: ldap
  id: ldap
  name: LDAP
    host: <LDAP host>
    usernamePrompt: username
      baseDN: dc=<domain>,dc=<>
      filter: (&(objectClass=posixAccount)(|(uid=<username>)))
      username: uid
      idAttr: uid
      emailAttr: mail
      nameAttr: givenName

Apply config

kubectl create configmap dex --from-file=config.yaml=dex-config.yaml -n auth --dry-run=client -oyaml | kubectl apply -f -

Restart Dex

kubectl rollout restart deployment dex -n auth

For details see

Disable unused ports in istio

Istio opens a few ports by default, you can disable (if not required) other than 80 and 443 with by modification of the service

kubectl edit services istio-ingressgateway -n istio-system
Professor of Computer Networks