Installing Kubeflow 1.4 on a Minikube Kubernetes Node
Why Kubeflow
Kubeflow is a software framework for machine learning workflows from training, optimization, and serving.
It integrates various components for this task, whereof one of the most famous ones is Jupyter Notebooks, still various useful tools to automate feature selection, network topology optimization, and hyperparameter optimization are included.
Furthermore, it integrates features such as a web interface, multi-tenancy by namespaces, and scalability by the use of Kubernetes as infrastructure framework.
Typically, it is well suited for large deployments and is available for various public clouds such Amazon Web Service, Google Cloud, Azure … .
Here, we target at a single host deployment to make Kubeflow available in a small lab environment.
For previous versions, there was a description for installation on Minikube, i.e. a single host Kubernetes environment. This installation description was removed and the installation process was heavily modified using Kubernetes manifests. Some descriptions using Arrikto MiniKF or Microk8s Kubeflow are described at https://www.kubeflow.org/docs/started/installing-kubeflow/ for local deployments. Still, all of these installations have some flaws I encountered during installation, which are also mentioned in the video tutorial https://youtu.be/C9Cl8EcqnfE.
Therefore, I share here my installation process for a local machine installation. Difficulties are especially in the selection of the correct version of Kubernetes and working container driver and container runtime.
Therefore, this blog post follows the description given at https://github.com/kubeflow/manifests#installation, but fills the gaps by selection of concrete software packages.
Our Hardware Setup
Our setup is based on an ASUS ESC4000A-E10 with
- AMD Epyc CPU 7443P
- 128 GB RAM DDR4-3200
- 1x Nvidia RTX 3900 GPU
- 1 TB NVMe M.2 SSD system partition
- 3.84 TB NVMe 2.5" SSD data partition
Currently, we only deploy one GPU for machine learning tasks, still this server can provide up to four cards. This description does not enable the GPU usage in the setup but may be provided in the future.
Our Software Setup
The most difficult part is the selection to find a set of working components, so far we use:
- Ubuntu 20.04 LTS with hwe kernel 5.11, the hwe kernel improves the tensorflow benchmark performance compared to the default kernel 5.4. The hwe kernel can easily be installed with:
sudo apt install --install-recommends linux-generic-hwe-20.04
- Docker and Containerd
- I tried podman and cri-o, too. But the cert-manager-webhook deployment does not start, using docker and containerd solves the problem.
- Minikube 1.22
- most recent version
- Kustomize v3.2.0
- as described at https://github.com/kubeflow/manifests#installation, new versions do not work
- Kubernetes 1.21
- API changes in Kubernetes 1.22 hinder the use of 1.22
- Kubeflow 1.4
- most recent version
Specific adaptations for our environment
To provide sufficient space for the containers, we set the data-root of docker to our data drive.
Change the docker.service
file
sudo vi /usr/lib/systemd/system/docker.service
and add the --data-root
dockerd option to the ExecStart
systemd option:
...
ExecStart=/usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --data-root=/data/docker
...
Install docker and containerd
See https://docs.docker.com/engine/install/ubuntu/
sudo apt-get update
sudo apt-get install ca-certificates curl gnupg lsb-release
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
sudo usermod -aG docker $USER && newgrp docker
The current dockerd version is 20.10.12.
Install minikube
See https://minikube.sigs.k8s.io/docs/start/
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube_latest_amd64.deb
sudo dpkg -i minikube_latest_amd64.deb
Create a Minikube cluster with latest Kubernetes version < 1.22
minikube start --kubernetes-version=1.21.8
For convenience set an alias for the kubeclt
command
alias kubectl="minikube kubectl --"
you may also create the file ~/.bash_aliases
and add the previous line for a permanent alias.
Install Kustomize
wget https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
chmod +x kustomize_3.2.0_linux_amd64
sudo mv kustomize_3.2.0_linux_amd64 /usr/bin
sudo ln -s /usr/bin/kustomize_3.2.0_linux_amd64 /usr/bin/kustomize
Install Kubeflow 1.4
See https://github.com/kubeflow/manifests#installation
Download the manifests, change into the directory, and checkout the latest release
git clone https://github.com/kubeflow/manifests.git
cd manifests
git checkout v1.4.1
For the next commands stay in this directory.
If desired, set a non-default user password for the default user. First create a password hash with python.
sudo apt install python3-passlib python3-bcrypt
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'
and set the hash
option in the file ./common/dex/base/config-map.yaml
.
vi ./common/dex/base/config-map.yaml
Install
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
Login
So far, we do not make the web interface publicly available and use SSH port forwarding.
On the Kubeflow machine expose the port 8080:
kubectl port-forward -n istio-system service/istio-ingressgateway 8080:80 --address=0.0.0.0
On the connection client forward the local port 8080 to the remote port 8080:
ssh -L 8080:localhost:8080 <remote-user>@<kubeflow-machine>
Open a web browser on the client machine and open localhost:8080 .
ToDo
- Add GPU to K8s https://github.com/NVIDIA/k8s-device-plugin
- Adding users, see https://youtu.be/AjNbcMGl8Y4
- LDAP integration, see https://cloudadvisors.net/2020/09/23/ldap-active-directory-with-kubeflow-within-tkg/
- Make Kubeflow webinterface publicly available using e.g. MetalLB