In this blog we would like to demonstrate how to use the new NVIDIA GPU operator to deploy GPU-accelerated workloads on an OpenShift cluster.

The new GPU operator enables OpenShift to schedule workloads that require use of GPUs as easily as one would schedule CPU or memory for more traditional not accelerated workloads. Start by creating a container that has a GPU workload inside it and request the GPU resource when creating the pod and OpenShift will take care of the rest. This makes deployment of GPU workloads to OpenShift clusters straightforward for users and administrators as it is all managed at the cluster level and not on the host machines. The GPU operator for OpenShift will help to simplify and accelerate the compute-intensive ML/DL modeling tasks for data scientists, as well as  help running infering tasks across data centers, public clouds, and at the edge. Typical workloads that can benefit from GPU acceleration include image and speech recognition, visual search and several others.

We assume that you have an OpenShift 4.x cluster deployed with some worker nodes that have GPU devices.

$ oc get no
NAME                           STATUS ROLES AGE VERSION
ip-10-0-130-177.ec2.internal   Ready worker 33m v1.16.2
ip-10-0-132-41.ec2.internal    Ready master 42m v1.16.2
ip-10-0-156-85.ec2.internal    Ready worker 33m v1.16.2
ip-10-0-157-132.ec2.internal   Ready master 42m v1.16.2
ip-10-0-170-127.ec2.internal   Ready worker 4m15s v1.16.2
ip-10-0-174-93.ec2.internal    Ready master 42m v1.16.2

In order to expose what features and devices each node has to OpenShift we first need to deploy the Node Feature Discovery (NFD) Operator (see here for more detailed instructions).

Once the NFD Operator is deployed we can take a look at one of our nodes; here we see the difference between the node before and after. Among the new labels describing the node features, we see:

feature.node.kubernetes.io/pci-10de.present=true

This indicates that we have at least one PCIe device from the vendor ID 0x10de, which is for Nvidia. These labels created by the NFD operator are what the GPU Operator uses in order to determine where to deploy the driver containers for the GPU(s).

However, before we can deploy the GPU Operator we need to ensure that the appropriate RHEL entitlements have been created in the cluster (see here for more detailed instructions). After the RHEL entitlements have been deployed to the cluster, then we may proceed with installation of the GPU Operator.

The GPU Operator is currently installed via helm chart, so make sure that you have helm v3+ installed. Once you have helm installed we can begin the GPU Operator installation.

     1. Add the Nvidia helm repo:

$ helm repo add nvidia https://nvidia.github.io/gpu-operator
"nvidia" has been added to your repositories

     2. Update the helm repo:

$ helm repo update
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "nvidia" chart repository
Update Complete. ⎈ Happy Helming!⎈

     3. Install the GPU Operator helm chart:

$ helm install --devel nvidia/gpu-operator --set platform.openshift=true,operator.defaultRuntime=crio,nfd.enabled=false --wait --generate-name

     4. Monitor deployment of GPU Operator:

$ oc get pods -n gpu-operator-resources -w

This command will watch the gpu-operator-resources namespace as the operator rolls out on the cluster. Once the installation is completed you should see something like this in the gpu-operator-resources namespace.

We can see that both the nvidia-driver-validation and the nvidia-device-plugin-validation pods have completed successfully and we have four daemonsets, each running the number of pods that match the node label feature.node.kubernetes.io/pci-10de.present=true. Now we can inspect our GPU node once again.

Here we can see the latest changes to our node which now include Capacity, Allocatable and Allocatable Resources for a new resource called nvidia.com/gpu. As we see above, since our GPU node only has one GPU we can see that reflected.

Now that we have the NFD Operator, cluster entitlements, and the GPU Operator deployed we can assign workloads that will use the GPU resources.

Let’s begin by configuring Cluster Autoscaling for our GPU devices. This will allow us to create workloads that request GPU resources and then will automatically scale our GPU nodes up and down depending on the amount of requests pending for these devices.

The first step is to create a ClusterAutoscaler resource definition, for example:

$ cat 0001-clusterautoscaler.yaml
apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  podPriorityThreshold: -10
  resourceLimits:
    maxNodesTotal: 24
    gpus:
      - type: nvidia.com/gpu
        min: 0
        max: 16
  scaleDown:
    enabled: true
    delayAfterAdd: 10m
    delayAfterDelete: 5m
    delayAfterFailure: 30s
    unneededTime: 10m
$ oc create -f 0001-clusterautoscaler.yaml
clusterautoscaler.autoscaling.openshift.io/default created

Here we define the number of nvidia.com/gpu resources that we expect for the Autoscaler.

After we deploy the ClusterAutoscaler, we deploy the MachineAutoscaler resource that references the MachineSet that is used to scale the cluster:

$ cat 0002-machineautoscaler.yaml
apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
  name: "gpu-worker-us-east-1a"
  namespace: "openshift-machine-api"
spec:
  minReplicas: 1
  maxReplicas: 6
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: gpu-worker-us-east-1a

$ oc create -f 0002-machineautoscaler.yaml
machineautoscaler.autoscaling.openshift.io/sj-022820-01-h4vrj-worker-us-east-1c created

The metadata name should be a unique MachineAutoscaler name, and the MachineSet name at the end of the file should be the value of an existing MachineSet.

Looking at our cluster, we check what MachineSets are available:

$ oc get machinesets -n openshift-machine-api
NAME                                   DESIRED CURRENT READY AVAILABLE AGE
sj-022820-01-h4vrj-worker-us-east-1a   1 1 1 1 4h45m
sj-022820-01-h4vrj-worker-us-east-1b   1 1 1 1 4h45m
sj-022820-01-h4vrj-worker-us-east-1c   1 1 1 1 4h45m

In this example the third MachineSet sj-022820-01-h4vrj-worker-us-east-1c is the one that has GPU nodes.

$ oc get machineset sj-022820-01-h4vrj-worker-us-east-1c -n openshift-machine-api -o yaml 
apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
 name: sj-022820-01-h4vrj-worker-us-east-1c
 namespace: openshift-machine-api
...
spec:
 replicas: 1
...
   spec:
     metadata:
         instanceType: p3.2xlarge
         kind: AWSMachineProviderConfig
         placement:
           availabilityZone: us-east-1c
           region: us-east-1

We can create our MachineAutoscaler resource definition, which would look like this:

$ cat 0002-machineautoscaler.yaml
apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
  name: "sj-022820-01-h4vrj-worker-us-east-1c"
  namespace: "openshift-machine-api"
spec:
  minReplicas: 1
  maxReplicas: 6
  scaleTargetRef:
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet
    name: sj-022820-01-h4vrj-worker-us-east-1c
$ oc create -f 0002-machineautoscaler.yaml
machineautoscaler.autoscaling.openshift.io/sj-022820-01-h4vrj-worker-us-east-1c created

We can now start to deploy RAPIDs using shared storage between multiple instances. Begin by creating a new project:

$ oc new-project rapids

Assuming you have a StorageClass that provides ReadWriteMany functionality like OpenShift Container Storage with cephfs, we can create a PVC to attach to our RAPIDs instances. (‘storageClassName` is the name of the StorageClass)

$ cat 0003-pvc-for-ceph.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
 name: rapids-cephfs-pvc
spec:
 accessModes:
 - ReadWriteMany
 resources:
   requests:
     storage: 25Gi
 storageClassName: example-storagecluster-cephfs


$ oc create -f 0003-pvc-for-ceph.yaml
persistentvolumeclaim/rapids-cephfs-pvc created


$ oc get pvc -n rapids
NAME                STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                    AGE
rapids-cephfs-pvc   Bound    pvc-a6ba1c38-6498-4b55-9565-d274fb8b003e   25Gi       RWX            example-storagecluster-cephfs   33s

Now that we have our shared storage deployed we can finally deploy the RAPIDs template and create the new application inside our rapids namespace:

$ oc create -f 0004-rapids_template.yaml
template.template.openshift.io/rapids created
$ oc new-app rapids
--> Deploying template "rapids/rapids" to project rapids
     RAPIDS
     ---------
     Template for RAPIDS
     A RAPIDS pod has been created.
     * With parameters:
        * Number of GPUs=1
        * Rapids instance number=1
--> Creating resources ...
    service "rapids" created
    route.route.openshift.io "rapids" created
    pod "rapids" created
--> Success
    Access your application via route 'rapids-rapids.apps.sj-022820-01.perf-testing.devcluster.openshift.com'
    Run 'oc status' to view your app.

 

In a browser we can now load the route that the template created above: rapids-rapids.apps.sj-022820-01.perf-testing.devcluster.openshift.com

gpu3

Image shows example notebook running using GPUs on OpenShift

We can also see on our GPU node that RAPIDs is running on it and using the GPU resource:

$ oc new-project rapids

Given we have more than one person that wants to run Jupyter playbooks, lets create a second RAPIDs instance with its own dedicated GPU.

$ oc new-app rapids -p INSTANCE=2
--> Deploying template "rapids/rapids" to project rapids

    RAPIDS
    ---------
    Template for RAPIDS

     A RAPIDS pod has been created.

    * With parameters:
       * Number of GPUs=1
       * Rapids instance number=2

--> Creating resources ...
   service "rapids2" created
   route.route.openshift.io "rapids2" created
   pod "rapids2" created
--> Success
   Access your application via route 'rapids2-rapids.apps.sj-022820-01.perf-testing.devcluster.openshift.com'
   Run 'oc status' to view your app.

But we just used our only GPU resource on our GPU node, so the new deployment of rapids (rapids2) is not schedulable due to insufficient GPU resources.

$ oc get pods -n rapids
NAME      READY   STATUS    RESTARTS   AGE
rapids    1/1     Running   0          30m
rapids2   0/1     Pending   0          2m44s

If we look at the event state of the rapids2 pod:

$ oc describe pod/rapids -n rapids
...
Events:
 Type     Reason            Age        From                Message
 ----     ------            ----       ----                -------
 Warning  FailedScheduling  <unknown>  default-scheduler   0/9 nodes are available: 9 Insufficient nvidia.com/gpu.
 Normal   TriggeredScaleUp  44s        cluster-autoscaler  pod triggered scale-up: [{openshift-machine-api/sj-022820-01-h4vrj-worker-us-east-1c 1->2 (max: 6)}]

We just need to wait for the ClusterAutoscaler and MachineAutoscaler to do their job and scale up the MachineSet as we see above. Once the new node is created:

$ oc get no 
NAME                           STATUS   ROLES    AGE     VERSION
(old nodes)
...
ip-10-0-167-0.ec2.internal     Ready    worker   72s     v1.16.2

The new RAPIDs instance will deploy to the new node once it becomes Ready with no user intervention.

To summarize, the new NVIDIA GPU operator simplifies the  use of GPU resources in OpenShift clusters. In this blog we’ve demonstrated the use-case for multi-user RAPIDs development using NVIDIA GPUs. Additionally we’ve used OpenShift Container Storage and the ClusterAutoscaler to automatically scale up our special resource nodes as they are being requested by applications.

As you observed, NVIDIA GPU Operator is already relatively easy to deploy using Helm and work is ongoing to support t deployments right from OperatorHub, simplifying this process even further. 

For more information on NVIDIA GPU Operator and OpenShift, please see the official Nvidia documentation.

1 – Helm 3 is in Tech Preview in OpenShift 4.3, and will GA in OpenShift 4.4


About the author

Sebastian Jug, a Senior Performance and Scalability Engineer, has been working on OpenShift Performance at Red Hat since 2016. He is a software engineer and Red Hat Certified Engineer with experience in enabling Performance Sensitive Applications with devices such as GPUs and NICs. His focus is in automating, qualifying and tuning the performance of distributed systems. He has been a speaker at a number of industry conferences such as Kubecon and STAC Global.

Read full bio