This blog explains how to leverage local NVMe disks that are present on the vsphere hypervisors inside of OCS. The disks will be directly forwarded to the VMs to keep the latency low. This feature is available for Tech Preview in OpenShift Container Storage 4.3.

Another deployment possibility is to use the NVMes as VMware datastore, which allows these devices to be shared with other VMs. This second option will not be discussed here.

Environment

Tested on VMware vSphere 6.7 U3 with latest patches installed as of Feb 26 2020 and local NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) drives.

OpenShift 4.3.1 running on Red Hat Enterprise Linux CoreOS 43.81.202002032142.0

"nodeInfo": {
   "kernelVersion": "4.18.0-147.3.1.el8_1.x86_64",
   "osImage": "Red Hat Enterprise Linux CoreOS 43.81.202002032142.0 (Ootpa)",
   "containerRuntimeVersion": "cri-o://1.16.2-15.dev.rhaos4.3.gita83f883.el8",
   "kubeletVersion": "v1.16.2",
   "kubeProxyVersion": "v1.16.2",
   "operatingSystem": "linux",
   "architecture": "amd64"
},

quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3

Preparing the disks

The NVMe disks must not be used for anything else.

When checking on the disks, they should appear like in Figure 1 - as attached, but “Not consumed”

 

Figure 1: NVMe disk is attached, but not consumed

Click on the available NVMe drive and note the multipath path. In my example it states:

Path Selection Policy Fixed (VMware) - Preferred Path (vmhba2:C0:T0:L0)

Now make sure your SSH service is started. In the host configuration screen, go to System → Services. Find the “SSH” service in the list and make sure it is in state Running

Figure 2: The SSH service needs to be running

Connect to the vSphere host via SSH. Use the root user and the password you set during the installation. Once connected, execute:

# lspci | grep NVMe
0000:af:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) [vmhba1]
0000:b0:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) [vmhba2]
0000:b1:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) [vmhba3]
0000:b2:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) [vmhba4]
0000:d8:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) [vmhba5]
0000:d9:00.0 Non-Volatile memory controller: Intel Corporation NVMe Datacenter SSD [3DNAND] 1.6TB 2.5" U.2 (P4600) [vmhba6]

Identify the disk you noted earlier (in our case vmhba2) and note the PCI location (the first block on the line - in this case 0000:b0:00.0)

Go back to the vCenter UI and still in the Host configuration, scroll to “Hardware” → “PCI Devices”. In that view, click on “Configure Passthrough”. In that list you will find the disk you noted earlier by PCI path.

Figure 3: Configure PCI passthrough for NVMe disk

Afterwards, you will see this new disk listed as Available (pending) for the passthrough and it will trigger you to restart the hypervisor. Please reboot the hypervisor.

Figure 4: NVMe has been added to passthrough devices, but the hypervisor has not been rebooted yet

Figure 5: NVMe is Available after Hypervisor reboot

After the hypervisor has been rebooted, the NVMe disk should be available, just like in Figure 5.

Adding the disks to the VM

Now that the NVMe disk is prepared, we have to add it to the VM. For this, the VM has to be powered down. Once the VM is off, open the VM settings and add these two items:

  • Add a NVMe controller
    • This is optional, but should speed up storage requests in the VM
  • Add a PCI device
    • Note that the VM needs to be scheduled on the host where your PCI device is present
  • Add a Hard Disk (Default of 16GB capacity is good)
    • This will be used for the Ceph Monitor filesystem

Afterwards, your VM settings should look similar to Figure 6.

Figure 6: VM settings after NVMe has been added

Extend the New PCI device (just like in Figure 6) and click the “Reserve all memory” button. Close the settings and power on the VM.

You can verify that the NVMe has been successfully added by running lsblk on the VM:

[core@compute-2 ~]$ lsblk
NAME                         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   60G  0 disk
|-sda1                         8:1    0  384M  0 part /boot
|-sda2                         8:2    0  127M  0 part /boot/efi
|-sda3                         8:3    0    1M  0 part
`-sda4                         8:4    0 59.5G  0 part
 `-coreos-luks-root-nocrypt 253:0    0 59.5G  0 dm   /sysroot
sdb                            8:16   0   16G  0 disk
nvme0n1                      259:0    0  1.5T  0 disk

Creating PVs with the local disks

To create PVs that we can eventually use with OCS, we will use the local-storage operator.

➜  website git:(master) ✗ oc get no
NAME              STATUS   ROLES    AGE   VERSION
compute-0         Ready    worker   44h   v1.16.2
compute-1         Ready    worker   44h   v1.16.2
compute-2         Ready    worker   44h   v1.16.2
compute-3         Ready    worker   44h   v1.16.2
compute-4         Ready    worker   44h   v1.16.2
compute-5         Ready    worker   44h   v1.16.2
control-plane-0   Ready    master   44h   v1.16.2
control-plane-1   Ready    master   44h   v1.16.2
control-plane-2   Ready    master   44h   v1.16.2

Apply the necessary labels to the nodes that will later be used by OCS

➜  website git:(master) ✗ oc label node compute-0 topology.rook.io/rack=rack0
node/compute-0 labeled
➜  website git:(master) ✗ oc label node compute-1 topology.rook.io/rack=rack1
node/compute-1 labeled
➜  website git:(master) ✗ oc label node compute-2 topology.rook.io/rack=rack2
node/compute-2 labeled
➜  website git:(master) ✗ oc label node compute-0 "cluster.ocs.openshift.io/openshift-storage="
node/compute-0 labeled
➜  website git:(master) ✗ oc label node compute-1 "cluster.ocs.openshift.io/openshift-storage="
node/compute-1 labeled
➜  website git:(master) ✗ oc label node compute-2 "cluster.ocs.openshift.io/openshift-storage="
node/compute-2 labeled

Verify that the node labels have been applied as expected

➜  website git:(master) ✗ oc get node -l topology.rook.io/rack
NAME        STATUS   ROLES    AGE   VERSION
compute-0   Ready    worker   44h   v1.16.2
compute-1   Ready    worker   44h   v1.16.2
compute-2   Ready    worker   44h   v1.16.2
➜  website git:(master) ✗ oc get node -l cluster.ocs.openshift.io/openshift-storage
NAME        STATUS   ROLES    AGE   VERSION
compute-0   Ready    worker   44h   v1.16.2
compute-1   Ready    worker   44h   v1.16.2
compute-2   Ready    worker   44h   v1.16.2

➜  website git:(master) ✗ oc new-project local-storage
Now using project "local-storage" on server [...]

Now go to the OpenShift web UI and install the “local-storage” operator from OperatorHub. Make sure to select the “local-storage” namespace as the install target.

➜  website git:(master) ✗ oc get po
NAME                                      READY   STATUS    RESTARTS   AGE
local-storage-operator-77f887bfd9-t9lx7   1/1     Running   0          4m43s

Verify which disks names are used on your machines

➜  website git:(master) ✗ for i in $(seq 4 6); do ssh core@10.70.56.9$i lsblk; done

Warning: Permanently added '10.70.56.94' (ECDSA) to the list of known hosts.

NAME                         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   60G  0 disk
|-sda1                         8:1    0  384M  0 part /boot
|-sda2                         8:2    0  127M  0 part /boot/efi
|-sda3                         8:3    0    1M  0 part
`-sda4                         8:4    0 59.5G  0 part
 `-coreos-luks-root-nocrypt 253:0    0 59.5G  0 dm   /sysroot
sdb                            8:16   0   16G  0 disk
nvme1n1                      259:0    0  1.5T  0 disk
Warning: Permanently added '10.70.56.95' (ECDSA) to the list of known hosts.
NAME                         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   60G  0 disk
|-sda1                         8:1    0  384M  0 part /boot
|-sda2                         8:2    0  127M  0 part /boot/efi
|-sda3                         8:3    0    1M  0 part
`-sda4                         8:4    0 59.5G  0 part
 `-coreos-luks-root-nocrypt 253:0    0 59.5G  0 dm   /sysroot
sdb                            8:16   0   16G  0 disk
nvme0n1                      259:0    0  1.5T  0 disk
Warning: Permanently added '10.70.56.96' (ECDSA) to the list of known hosts.
NAME                         MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                            8:0    0   60G  0 disk
|-sda1                         8:1    0  384M  0 part /boot
|-sda2                         8:2    0  127M  0 part /boot/efi
|-sda3                         8:3    0    1M  0 part
`-sda4                         8:4    0 59.5G  0 part
 `-coreos-luks-root-nocrypt 253:0    0 59.5G  0 dm   /sysroot
sdb                            8:16   0   16G  0 disk
nvme0n1                      259:0    0  1.5T  0 disk

Now create the LocalVolume entities that will create PVs for the local disks

NOTE: In our example we had two different names for the NVMe disks, that’s why we had to list them both in the local-block entry.

Make sure to adjust the devicePaths in both LocalVolume instances as necessary. The local-block LocalVolume should target your NVMe drives, the local-fs should target the 16GB Harddrive.

➜  website git:(master) ✗ cat <<EOF | oc create -n local-storage -f -
apiVersion: local.storage.openshift.io/v1
kind: LocalVolume
metadata:
 name: local-block
 namespace: local-storage
spec:
 nodeSelector:
   nodeSelectorTerms:
   - matchExpressions:
       - key: cluster.ocs.openshift.io/openshift-storage
         operator: Exists
 storageClassDevices:
   - storageClassName: local-block
     volumeMode: Block
     devicePaths:
       - /dev/nvme0n1
       - /dev/nvme1n1
EOF
localvolume.local.storage.openshift.io/local-block created

➜  website git:(master) ✗ cat <<EOF | oc create -n local-storage -f -
apiVersion: local.storage.openshift.io/v1
kind: LocalVolume
metadata:
 name: local-fs
 namespace: local-storage
spec:
 nodeSelector:
   nodeSelectorTerms:
   - matchExpressions:
       - key: cluster.ocs.openshift.io/openshift-storage
         operator: Exists
 storageClassDevices:
   - storageClassName: local-fs
     fsType: xfs
     volumeMode: Filesystem
     devicePaths:
       - /dev/sdb
EOF
localvolume.local.storage.openshift.io/local-fs created

Now verify that the DS,PV and Pods exist and look similar to the output below

➜  website git:(master) ✗ oc get ds
NAME                            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
local-block-local-diskmaker     3         3         3       3            3           <none>          4m49s
local-block-local-provisioner   3         3         3       3            3           <none>          4m49s
local-fs-local-diskmaker        3         3         3       3            3           <none>          16s
local-fs-local-provisioner      3         3         3       3            3           <none>          16s
➜  website git:(master) ✗ oc get pv
NAME                CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM   STORAGECLASS   REASON   AGE
local-pv-3db3ff28   1490Gi     RWO            Delete           Available           local-block             3m28s
local-pv-53f7eacf   16Gi       RWO            Delete           Available           local-fs                6s
local-pv-6f7637fd   1490Gi     RWO            Delete           Available           local-block             4m1s
local-pv-88d91069   16Gi       RWO            Delete           Available           local-fs                6s
local-pv-c279a4a2   16Gi       RWO            Delete           Available           local-fs                6s
local-pv-cdfce476   1490Gi     RWO            Delete           Available           local-block             3m54s
➜  website git:(master) ✗ oc get po
NAME                                      READY   STATUS    RESTARTS   AGE
local-block-local-diskmaker-2fjjc         1/1     Running   0          5m19s
local-block-local-diskmaker-fm2xj         1/1     Running   0          5m19s
local-block-local-diskmaker-qj2t4         1/1     Running   0          5m20s
local-block-local-provisioner-k5mlj       1/1     Running   0          5m20s
local-block-local-provisioner-pvgm2       1/1     Running   0          5m20s
local-block-local-provisioner-t6bwp       1/1     Running   0          5m20s
local-fs-local-diskmaker-jxdbk            1/1     Running   0          47s
local-fs-local-diskmaker-rwmmv            1/1     Running   0          47s
local-fs-local-diskmaker-z4lh4            1/1     Running   0          47s
local-fs-local-provisioner-9w4jg          1/1     Running   0          47s
local-fs-local-provisioner-kkqxq          1/1     Running   0          47s
local-fs-local-provisioner-xsn6v          1/1     Running   0          47s
local-storage-operator-77f887bfd9-t9lx7   1/1     Running   0          11m

Setting up OCS to use the local disks

Create the OCS namespace

➜  website git:(master) ✗ cat << EOF | oc create -f -
apiVersion: v1
kind: Namespace
metadata:
 labels:
   openshift.io/cluster-monitoring: "true"
 name: openshift-storage
spec: {}
EOF
namespace/openshift-storage created

Now go to the OpenShift web UI and install OCS through OperatorHub. Make sure to select the openshift-storage namespace as the install target.

After OCS is Successfully installed, create the StorageCluster like below:

➜  website git:(master) ✗ cat << EOF | oc create -f -
   apiVersion: ocs.openshift.io/v1
   kind: StorageCluster
   metadata:
     namespace: openshift-storage
     name: ocs-storagecluster
   spec:
     manageNodes: false
     monPVCTemplate:
       spec:
         storageClassName: local-fs
         accessModes:
         - ReadWriteOnce
         resources:
           requests:
             storage: 10Gi
     resources:
       mon:
         requests: {}
         limits: {}
       mds:
         requests: {}
         limits: {}
       rgw:
         requests: {}
         limits: {}
       mgr:
         requests: {}
         limits: {}
       noobaa-core:
         requests: {}
         limits: {}
       noobaa-db:
         requests: {}
         limits: {}
     storageDeviceSets:
     - name: deviceset-a
       count: 3
       resources:
         requests: {}
         limits: {}
       placement: {}
       dataPVCTemplate:
         spec:
           storageClassName: local-block
           accessModes:
           - ReadWriteOnce
           volumeMode: Block
           resources:
             requests:
               storage: 500Gi
       portable: false
EOF

Now wait for the OCS cluster to initialise. You can watch the installation with

watch oc get po -n openshift-storage

Q&A

Why use VMDirectPath I/O and not RDM?

Direct attached block devices cannot be used for RDM as stated in this VMware document.

How can I ensure the disks are clean before I use them with OCS?

You can run sudo sgdisk --zap-all /dev/nvmeXnX inside of your VMs before using them. If you have already installed OCS and you are failing in the OSD prepare job you can safely run this and the prepare jobs will retry.

Additional Resources

OpenShift Container Storage: openshift.com/storage

OpenShift | Storage YouTube Playlist

OpenShift Commons ‘All Things Data’ YouTube Playlist

Feedback

To find out more about OpenShift Container Storage or to take a test drive, visit https://www.openshift.com/products/container-storage/.

If you would like to learn more about what the OpenShift Container Storage team is up to or provide feedback on any of the new 4.3 features, take this brief 3-minute survey.