In a blog post I wrote on the Red Hat Developer’s Blog, I wrote about multiple layers of security available while deploying Red Hat Data Grid on Red Hat Openshift. Another challenging problem I see for customer is performing a no downtime upgrade for Red Hat Data Grid images (published on Red Hat Container Catalog). That's what we're going to tackle in this post.

If you're new to it, Red Hat Data Grid is an in-memory, distributed, NoSQL datastore solution. With it, your applications can access, process, and analyze data at in-memory speed designed to deliver a superior user experience compared to traditional data stores like relational databases. In-memory Data Grids have a variety of use cases in today’s environments, such as fast data access for low-latency apps, storing objects (NoSQL) in a datastore, achieving linear scalability with data distribution/partitioning, and data high-availability across geographies.

Red Hat Data Grid runs like other applications on Openshift. However, Data Grid is a stateful application and the upgrade process for stateful applications can be challenging in the container world. The clustering capabilities of Data Grid add another layer of complexity, as well.

Red Hat releases container images for a number of its products on its container catalog website. Red Hat provides a container health index of each image and updates that health index as new vulnerabilities are found or new versions of the product are released. So an image today having an A rating on the container health index may not maintain the same index six months down the line, as new issues surface after time in the field.

Why "Rolling" upgrades are not suitable for Red Hat Data Grid?

For clustered applications, it does make sense to have rolling upgrade. However, there is a critical thing in templates exposed by Red Hat for deploying Data Grid. Here are the templates for Data Grid version 7. If you open any template file, you will see that the upgrade strategy is to "recreate" data grid pods. It is not defined as "rolling." It is expected that two pods from different major versions may not work in a single data grid cluster. The below section outlines how one could upgrade data grid versions with no downtime.

The objective here is to upgrade Red Hat Data Grid in place with no downtime. While we have experienced no downtime upgrades with this method, this may not work in every environment and against every application. We strongly recommend practicing this in a development or test environment before attempting in production.

Four Steps Designed to Upgrade Data Grid

The Operator Framework, which appears in OpenShift 4, should make the upgrade process easier. Operator Hub. lists out operators available today for various products. The Data Grid Operator could provide the capability to install, upgrade, scale, monitor, backup and restore an application.

However, until we have an Operator for Red Hat Data Grid, we need to find an alternative to upgrade version 6 to version 7. Red Hat Data Grid Operator is still in development at this time.

So, let's assume we have deployed our Data Grid--which we’ll call version 1--into our Openshift cluster. Another application is accessing the data grid via a route (can be a service URL as well, does not matter). We now need to upgrade to version 2, without losing data.

If you would like to play with the upgrade process, have a look at “Orchestrating JBoss Data Grid Upgrades in Openshift/Kubernetes” on GitHub. There, you can find code snippets and commands to upgrade from one version to another. I recommended to try this process first in your dev/test environment before rolling out in production.

Step 1 - Deploy version 2 and use "Remote Store"

We need to do following while deploying v2 -

  • Define all caches in v2 as "Remote Store". A remote store is defined as a store that stores and loads data configured in another data grid.
  • Provide details (service address and port) of v1 while configuring remote store in v2.

We cannot deploy v2 in the same way we deploy v1. At this point in time, the templates do not expose a way to declare remote cache store and some other required details. Therefore, we need to find an alternative way to deploy v2. We will deploy v2 using a custom configuration file. Here is what the process would look like:

  • Define a custom configuration file and name it standalone.xml
    • A sample file is located here.
    • See how "mycache" cache is defined here.

 

<distributed-cache name="mycache">

      <remote-store cache="mycache" socket-timeout="60000"

           tcp-no-delay="true" protocol-version="2.6" shared="true"

           hotrod-wrapping="true" purge="false"  passivation="false">

            <remote-server

                                         outbound-socket-binding="remote-store-hotrod-server"/>

             </remote-store>

  </distributed-cache>

    • Define remote data grid server. Replace Data Grid service url with service IP of v1 pod running in OpenShift.

 

<outbound-socket-binding name="remote-store-hotrod-server">

         <remote-destination host="<REPLACE SOURCE Data Grid SERVICE

              URL>" port="11333"/>

        </outbound-socket-binding>

 

  • Define a config map that contains data as standalone.xml file ("oc create configmap --from-file ..." construct).
  • Create a new template for deploying v2. This template would include instructions to use configmap and mount it to /opt/datagrid/standalone/configuration/user location
    • A sample template file is located here.
    • Search for config-volume and observe that "datagrid-config" configmap is mounted at the above location.
    • We should set the deployment strategy to "Rolling." This is a change we are doing with respect to out of the box templates provided by Red Hat which have deployment strategy set to "Recreate."
    • We will also set "minReadySeconds" parameter to 60 as shown in the sample template file above.

Once we have made these changes, we are ready to deploy v2. After a successful deployment, we need to map the route (or service selector for v1) to start routing traffic to v2's service IP (or point to v2's pods via selector). Once this change is completed, next request to data grid will go to v2, and v2 will load data from v1. The figure below describes this state -

Step 2 - Syncing data

Once we have v2 deployed, we need to ssh into one of the v2's pod and use CLI commands to copy/sync all data from v1 to v2. Run these commands to perform this data syncing -

$ oc rsh <v2 pod name>

sh-4.2$ /opt/datagrid/bin/cli.sh --connect controller=localhost:9990 -c "/subsystem=datagrid-infinispan/cache-container=clustered/distributed-cache=mycache:synchronize-data(migrator-name=hotrod)"

{"outcome" => "success"}

Run the above command for all caches defined in data grid.

Step 3 - Rolling Upgrade

Now that data is synced, it makes sense to delete v1. However, we can't just delete v1 now because the cache configuration in v2 still remains "remote cache" and refers to the service IP of v1. If v1 is deleted and a request comes for a key which does not exist in v2, then v2 will try to load data from v1. This request will fail.

We need to find a way to get rid of this dependency. A rolling upgrade for v2 with new configuration (no version upgrade) could potentially remove this dependency. The following changes are done to the existing deployment configuration of v2  -

  • Change cache definition - Edit configmap (which is holding cache configuration) and change cache definition from

<distributed-cache name="mycache">

             <remote-store cache="mycache" socket-timeout="60000"

                       tcp-no-delay="true" protocol-version="2.6" shared="true"

     hotrod-wrapping="true"

                             purge="false" passivation="false">

                            <remote-server

                                  outbound-socket-binding="remote-store-hotrod-server"/>

           </remote-store>

  </distributed-cache>

            to

<distributed-cache name="mycache" mode="SYNC"/>

  • Change remote-destination - Edit configmap and change remote destination host from

  <remote-destination host="172.30.232.114" port="11333"/>

            to

<remote-destination host="remote-host" port="11333"/>

Note that 172.30.232.114 above is the service IP of v1.

Roll out the changes for the v2's deployment config after completing the above changes. Since we defined our deployment strategy as "Rolling", a rolling update will start. Remember, we defined a new parameter "minReadySeconds" in the previous step. We defined this parameter because we don't want to kill the existing pod when a new pod comes up. If we don't wait for the minimum time, then it is highly likely that the new pod is unable to replicate data (when it joins the cluster) before OpenShift kills existing pod and continues with rolling upgrade. This may end up causing data loss.

Eventually when the rolling update completes, you will have new pods that have local definitions of cache(s) and don't refer to v1 anymore. We have successfully removed this dependency. The final state would look like the figure below:

Step 4 - Delete Version 1

This last step is pretty straightforward. Just delete version 1. At this point in time, you should have been able to successfully completed migrating data grid from one major version to another. The figure here represents the stages that we went through:

If you would like to practice these recommendations, see “Orchestrating JBoss Data Grid Upgrades on Openshift/Kubernetes” on GitHub. There, you can find code snippets and commands to upgrade one version to another. The examples in the github link uses data grid version 7.2.

 


About the author

Red Hatter since 2018, tech historian, founder of themade.org, serial non-profiteer.

Read full bio