Note: The following procedure can also be used to deploy the NVIDIA GPU Operator, since it follows the same prerequisites as the SRO operator. Docs are here.
The job of the Performance and Latency Sensitive Applications (PSAP) team at Red Hat is optimizing Red Hat OpenShift, the industry’s most comprehensive enterprise Kubernetes platform, to run compute-intensive enterprise workloads and HPC applications effectively and efficiently. As a team of Linux and performance enthusiasts who are always pushing the limits of what is possible with the latest and greatest upstream technologies, we are operating at the forefront of innovation with compelling proof-of-concept (POC) implementations and advanced deployment scenarios.
Driver containers are a novel way of including device specific kernel modules (kmods) within an OCI container. Since these kmods have close dependencies on kernel versions (and kernel headers), they need to be (re) compiled on the target host. The special resource operator (SRO for short) was designed for this purpose.
However, the SRO needs access to RHEL source code from the target host. And while this is fully automated in environments that can access the internet and ergo the RHEL source code, setting it up for disconnected environments requires some more configuration.
This blog post details the deployment of SRO/driver containers on disconnected (true disconnected and proxy) environments.
You must have access to the internet to obtain the data that populates the mirror repository. In this procedure, you will place the mirror registry on a bastion host that has access to both your network and the internet. If you do not have access to a bastion host, use the method that best fits your restrictions to bring the contents of the mirror registry into your restricted network. You also must have a Red Hat Enterprise Linux (RHEL) server on your network to use as the registry host. The registry host MUST be able to access the internet, or at least allow access to the needed URL’s mentioned through this guide.
The cluster must be properly configured and entitled as seen in:
Part 1 - Setting the Mirror Registry and OLM Catalog
Step 1: Create a Mirror Registry
Note: You must ensure that your registry hostname is in the same DNS and that it resolves to the expected IP address. Otherwise, pulls will fail because cert x509 is for a hostname and not a public name.
Step 2: Authenticate the Mirror Registry
[Bastion host/Local host]
Now, let’s allow our cluster to reference images from the mirror registry we just built.
[Optional] For authenticating your mirror registry, you need to configure additional trust stores for image registry access in our OCP cluster. You can create a ConfigMap in the openshift-config namespace and use its name in AdditionalTrustedCA in the image.config.openshift.io resource. This provides additional CAs that should be trusted when contacting external registries.
The ConfigMap key is the hostname + port of a registry for which this CA is to be trusted, and the base64-encoded certificate is the value for each additional registry CA to trust.
You can configure additional CAs with the following procedure:
$ oc create configmap registry-config --from-file=<external_registry_address>=ca.crt -n openshift-config
$ oc edit image.config.openshift.io cluster
Note: if your <external_registry_address> contains a ':5000',.it should be written as ‘..5000’ to avoid this error:
error: "xxxxxxxxxx::5000" is not a valid key name for a ConfigMap: a valid config key must consist of alphanumeric characters, '-', '_' or '.' (e.g. 'key.name', or 'KEY_NAME', or 'key-name', regex used for validation is '[-._a-zA-Z0-9]+')
Step 3: Building an Operator Catalog Image
Note: For now, we need to tell the architecture we want to mirror into the registry using the oc CLI. To achieve this during both steps, you need to pass the flag --filter-by-os='linux/amd64’:
oc adm catalog build --filter-by-os='linux/amd64’ ….
oc adm catalog mirror --filter-by-os='linux/amd64’ ….
This prevents a known error due to the docker registry not supporting multiple architectures manifests.
[Optional] Mirror Images for HELM Deployment
After deploying the mirror image registry in step 2:
Mirror the images listed at: https://github.com/NVIDIA/gpu-operator/blob/master/bundle/manifests/gpu-operator.clusterserviceversion.yaml#L128
- name: gpu-operator-image
- name: dcgm-exporter-image
- name: container-toolkit-image
- name: driver-image
- name: device-plugin-image
- name: gpu-feature-discovery-image
- name: cuda-sample-image
- name: dcgm-init-container-image
Then follow this guide: https://docs.openshift.com/container-platform/4.6/openshift_images/image-configuration.html to configure the `registrySources` of OpenShift to pull those images from the mirror registry.
Part 2 - Setting the YUM Mirror and Driver Container
Note: Part 2 is only needed for SRO or the NVIDIA GPU Operator; the NFD operator does not need this step.
For setting up a YUM mirror, we can choose to use Red Hat Satellite or create a custom-made mirror following.
The packages we need to host in our mirror are:
These packages are needed to run the driver container, as can be seen at: https://gitlab.com/nvidia/container-images/driver/-/blob/master/rhel8/nvidia-driver .
Note: You can get the $HOST_ARCH and $GPU_NODE_KERNEL_VERSION from `oc describe node` on one of the nodes.
With the YUM-mirror in place, the next step is to add the repository configuration to the driver container:1. First, we create a ConfigMap containing the repository configuration file (my_mirror.repo)
oc create configmap yum-repos-d --from-file /path/to/my_mirror.repo
2. Add the mirror repository to the operator buildConfig. For SRO this information must be added to: https://github.com/openshift-psap/special-resource-operator/blob/master/config/recipes/nvidia-gpu/manifests/1000-state-driver.yaml
For the NVIDIA-GPU-Operator v1.4 and above (currently 1.5.2) and for versions before 1.4, follow the same instructions as SRO:
1. Create a configmap with custom repo list:
bash2. Specify repoConfig in values.yaml (If deploying from HELM:)
oc create configmap repo-config -n gpu-operator-resources --from-file /path/to/my_mirror.repo
Or Edit the driver.repoConfig entry at the ClusterPolicy CR
3. Deploy the operator via HELM
4. Verify ConfigMap is mounted successfully with driver container
Now you are ready to deploy the SRO / GPU-operator to your disconnected OCPO cluster.
We believe that Linux containers and container orchestration engines, most notably Kubernetes, are well positioned to power future software applications spanning multiple industries and verticals. Red Hat has embarked on a mission to enable some of the most critical workloads, like machine learning, deep learning, artificial intelligence, big data analytics, high-performance computing, and telecommunications, with Red Hat OpenShift. The PSAP team is supporting this mission across multiple footprints (public, private, and hybrid cloud), industries, and application types.
- It is not mentioned in all the documentation, but it is good to start by deploying a medium-sized instance to host the registry.