One of the most interesting new features in Kubernetes is its support for Graphical Processing Units (GPUs). This makes it possible to obtain significant performance benefits when deploying certain types of applications (for example, machine learning applications like TensorFlow) in a Kubernetes cluster.
Support for scheduling GPU workloads in Kubernetes is still fairly new, so getting all the pieces working together usually requires a fair amount of trial and effort. To ease this process, this blog post will walk you through the process, showing you how to set up a Kubernetes GPU cluster using Google Container Engine (GKE) and then deploying a container image with GPU support to that cluster.
NOTE: This blog post will assume that you have the gcloud and kubectl command-line tools installed and configured for use with GKE. In case you don't, check out our Kubernetes tutorial for a detailed walkthrough.
Step 1: Start a GPU-enabled cluster
The first step is to launch a GPU-enabled Kubernetes cluster. Here are a few important things to remember:
- As of this writing, Google Compute Engine supports NVIDIA P100 and K80 GPUs only in certain zones, so remember to use a supported zone when spinning up the cluster. See the list of available zones and restrictions.
- As of this writing, GPU support in Kubernetes is an alpha feature so your GKE cluster needs to be an "alpha cluster". Alpha clusters are different from regular GKE clusters in several respects: they are not covered by the GKE SLA, they cannot be upgraded and they only last for 30 days. Find out more about alpha clusters.
- GPUs are quota-restricted on Google Compute Engine, so ensure that you have adequate quota available before proceeding. Learn more about quotas.
- The NVIDIA GPU drivers need to be separately installed on each node of a Kubernetes cluster. Therefore, when launching the cluster, select an image that has the necessary build and/or packaging tools for driver installation. Find out more.
Here's an example command to start a GPU-enabled alpha cluster on GKE. This cluster will run in the us-east1-c zone and will be composed of three hosts, each running Ubuntu and with an NVIDIA K80 GPU:
$ gcloud alpha container clusters create my-gpu-cluster --enable-cloud-logging --enable-cloud-monitoring --accelerator type=nvidia-tesla-k80,count=1 --zone us-east1-c --machine-type n1-standard-2 --enable-kubernetes-alpha --image-type UBUNTU --num-nodes 3
Step 2: Install the GPU drivers on each cluster node
Once the cluster has started, the next step is to log into each cluster node individually using SSH and install the NVIDIA CUDA libraries, which include the necessary NVIDIA GPU drivers. The Google Cloud Console offers browser-based SSH access to each node. Once logged in, run the commands below:
$ curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb $ sudo -s $ dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb $ apt-get update && apt-get install cuda -y
Once the libraries have been installed on each host, check if the NVIDIA GPU has been detected with the nvidia-smi tool:
Step 3: Configure each kubelet to use the NVIDIA GPU
Next, configure each kubelet to use the NVIDIA GPU, as shown below:
$ NVIDIA_GPU_NAME=$(nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 | sed -e 's/ /-/g') $source /etc/default/kubelet $ KUBELET_OPTS="$KUBELET_OPTS --node-labels='alpha.kubernetes.io/nvidia-gpu-name=$NVIDIA_GPU_NAME'" $ echo "KUBELET_OPTS=$KUBELET_OPTS" > /etc/default/kubelet $ systemctl restart kubelet.service
Step 4: Deploy and test a GPU-enabled container
At this point, you're ready to deploy your GPU-enabled container. This example will use a TensorFlow image with GPU support, and the following Kubernetes deployment file (based on this example by Frederic Tausch):
--- apiVersion: extensions/v1beta1 kind: Deployment metadata: name: tensorflow-gpu spec: replicas: 1 template: metadata: labels: app: tensorflow-gpu spec: volumes: - hostPath: path: /usr/lib/nvidia-384/bin name: bin - hostPath: path: /usr/lib/nvidia-384 name: lib - hostPath: path: /usr/lib/x86_64-linux-gnu/libcuda.so.1 name: libcuda-so-1 - hostPath: path: /usr/lib/x86_64-linux-gnu/libcuda.so name: libcuda-so containers: - name: tensorflow image: tensorflow/tensorflow:latest-gpu ports: - containerPort: 8888 resources: limits: alpha.kubernetes.io/nvidia-gpu: 1 volumeMounts: - mountPath: /usr/local/nvidia/bin name: bin - mountPath: /usr/local/nvidia/lib name: lib - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so.1 name: libcuda-so-1 - mountPath: /usr/lib/x86_64-linux-gnu/libcuda.so name: libcuda-so --- apiVersion: v1 kind: Service metadata: name: tensorflow-gpu-service labels: app: tensorflow-gpu spec: selector: app: tensorflow-gpu ports: - port: 8888 protocol: TCP nodePort: 30061 type: LoadBalancer ---
There are two important points to note about the deployment above:
- The NVIDIA libraries on the host are exposed to the Kubernetes pod using the hostPath directive. Remember to update this path if your drivers are installed to a different location.
- NVIDIA GPU resources used by the container are specified in the containers section using the special resource name alpha.kubernetes.io/nvidia-gpu.
TIP: If you'd like to build your own custom GPU-enabled TensorFlow image, start with the Bitnami Docker TensorFlow Serving image and then follow these instructions to add GPU support to it.
Deploy the GPU-enabled TensorFlow container using the command below:
$ kubectl create -f deployment.yaml
You should now be able to see the running pods with kubectl get pods.
Executing the nvidia-smi command within a pod should display the same output as running it directly on the cluster node.
This demonstrates that the pod is able to access the GPU.
As this example has illustrated, deploying an application with GPU support in a Kubernetes cluster is not as simple as what you're probably used to. There are definitely a few additional hoops you have to jump through… But once you've done that, you have a scalable, flexible solution for all your GPU-accelerated workloads. Plus, remember that GPU support in Kubernetes is under active development, so expect things to get significantly easier in future!
Want to reach the next level in Kubernetes?