This is a guest post by Jonathan Katz, Vice President of Platform Engineering at Crunchy Data & PostgreSQL Core Team Member.

Edge computing helps businesses become more proactive and dynamic by placing applications and processing power closer to the devices and sensors that create or consume data. This enables them to gather, analyze, and turn large data flows into actionable insights, faster. We see this in delivery services that produce vast arrays of sensor data and benefit from the rich analytics that result from aggregation and processing this data. These connected vehicles and the supporting fleet management systems stand to benefit from cloud-native data analytic services using AI-powered intelligent applications to gather insights and solve problems in real time. 

One powerful application of the potential for cloud-native data analytic services built from PostgreSQL and OpenShift is spatial data analytics of connected vehicles. 

The raw data generated by fleets of moving objects fitted with sensors (aka “telematics”) is voluminous and fast changing. This data contains many analytical insights that can be mined using systems that combine stream processing to extract data of interest with edge databases to collate and analyze those data.

Implementing databases and data analytics with cloud-native applications involves several steps and tools, often including data Ingestion; preliminary data storage; data preparation; data storage for analytics; and data analysis. An open, adaptable architecture will help you execute this process more effectively. This architecture requires several key technologies. Container and Kubernetes platforms, such as Red Hat OpenShift provide a consistent foundation for deploying databases, data analytics tools, and cloud-native

applications across infrastructure, as well as self-service capabilities for developers and integrated compute acceleration.

Change Data Capture (CDC) captures row-level changes to database tables and passes corresponding change events to a data streaming bus. Applications can read these change event streams and access these change events in the order in which they occurred. Thus, CDC helps to bridge traditional data stores and new cloud-native event-driven architectures. 

Red Hat Integration includes the Red Hat Change Data Capture (Debezium) connectors to Red Hat AMQ Streams which are deployed and managed using Kafka Connect. Debezium is a set of distributed services that captures row-level changes in databases so that applications can see and respond to those changes. Debezium is built upon the Apache Kafka project and uses Kafka to transport the changes from one system to another. 

The most interesting aspect of Debezium is that at the core it is using CDC to capture the data and push it into Kafka. The source PostgreSQL database remains untouched in the sense that we do not have to add triggers or log tables. This is a huge advantage as triggers and log tables degrade performance. In addition, PostgreSQL manages the changes in such a way that we do nott lose them if we restart.

This makes the system much more flexible. If you want to add a new microservice, simply subscribe to the topic in Kafka that is pertinent to the service.

Let’s dive in and see how to actually make this work.

Deploying Red Hat AMQ Stream, Red Hat Integration, and Crunchy PostgreSQL on OpenShift

Red Hat OpenShift and the Kubernetes Operator pattern provide the ability to deploy this cloud-native data analytic solution using the automation and standardization afforded by containers and Kubernetes.  

The following section will provide steps for deploying Red Hat AMQ Stream, Red Hat Integration, and Crunchy PostgreSQL on OpenShift using the Red Hat Marketplace and Operators. 

image1 (1)

Set up the Crunchy Postgres Operator. We will create a new project called "pgo" that will serve as the home of both our PostgreSQL cluster and the Debezium components.

Create a new project that will be used for this exercise, which will include the Postgres Operator:

oc new-project pgo

You can then proceed to install the Postgres Operator. Be sure to install it into the "pgo" namespace.

(Prior to PostgreSQL Operator version 4.6.2, on OpenShift you will need to edit "pgo.yaml" entry in "pgo-config" ConfigMap and change "DisableFSGroup" to "true" and restart the PostgreSQL Operator Pod.)

Once the Postgres Operator is up and running and "DisableFSGroup" is set correctly, go ahead and create a PostgreSQL cluster named "hippo" with some explicit passwords set. We will return and use this cluster later:

pgo create cluster hippo -n pgo --password-superuser=datalake --password=datalake

As mentioned previously, Debezium uses Kafka to transport changes between systems, so we will need to set up Kafka. This can be done using the Strimzi Kafka Operator. First, download the Strimzi Kafka Operator to your local environment:

export STRIMZI_VERSION=0.18.0
git clone --depth 1 -b $STRIMZI_VERSION https://github.com/strimzi/strimzi-kafka-operator
cd strimzi-kafka-operator

To install the Strimzi Kafka Operator, you will need to make some substitutions in the "install/cluster-operator" directory. Find all values that say:

"namespace: myproject"
and replace them with:
"namespace: pgo"

Then run the following commands to set up the Strimzi Kafka Operator:

oc create -f install/cluster-operator && oc create -f examples/templates/cluster-operator

Once this operator is deployed, run the following command to set up a Kafka broker:

oc process strimzi-ephemeral -p CLUSTER_NAME=broker -p ZOOKEEPER_NODE_COUNT=1 -p KAFKA_NODE_COUNT=1 -p KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=1 -p KAFKA_TRANSACTION_STATE_LOG_REPLICATION_FACTOR=1 | oc apply -f -

To run Debezium in a container, we need to build an image that contains the connector that we require, in this case PostgreSQL. Download the Debezium Postgres container to your local environment:

CONNECTOR_VERSION="1.4.2.Final" curl "https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/${CONNECTOR_VERSION}/debezium-connector-postgres-${CONNECTOR_VERSION}-plugin.tar.gz" | tar xvz

Next, create a Dockerfile that includes the Debezium connector as a Kafka plugin:

FROM strimzi/kafka:0.18.0-kafka-2.5.0
USER root:root
RUN mkdir -p /opt/kafka/plugins/debezium
COPY ./debezium-connector-postgres/ /opt/kafka/plugins/debezium/
USER 1001

Build and push the image to your local repository. For example, if you are running CodeReady Containers, this would look similar to:

podman build . -t default-route-openshift-image-registry.apps-crc.testing/pgo/connect-debezium
podman push default-route-openshift-image-registry.apps-crc.testing/pgo/connect-debezium

Ensure the image is in your image stream:

oc set image-lookup connect-debezium

Now, we need to set up a secret that contains the credentials for Debezium to access the Postgres cluster. We can do this with the command below:

cat <<EOF > debezium-postgres-credentials.properties
postgres_username: postgres
postgres_password: datalake
EOF
oc create secret generic postgres-credentials \
 --from-file=debezium-postgres-credentials.properties

Next, create a Kafka Connect custom resource. This will allow for Debezium to transfer data between your PostgreSQL cluster and Kafka:

apiVersion: kafka.strimzi.io/v1beta1
kind: KafkaConnect
metadata:
 name: connect-debezium
 annotations:
 # use-connector-resources configures this KafkaConnect
 # to use KafkaConnector resources to avoid
 # needing to call the Connect REST API directly
   strimzi.io/use-connector-resources: "true"
spec:
 Image: pgo/connect-debezium
 replicas: 1
 bootstrapServers: broker-kafka-bootstrap:9093
 tls:
   trustedCertificates:
     - secretName: broker-cluster-ca-cert
       certificate: ca.crt
 config:
   config.storage.replication.factor: 1
   offset.storage.replication.factor: 1
   status.storage.replication.factor: 1
   config.providers: file
   config.providers.file.class: org.apache.kafka.common.config.provider.FileConfigProvider
 externalConfiguration:
   volumes:
     - name: connector-config
       secret:
         secretName: postgres-credentials

Let's set up some data. Connect to the "hippo" database and create some entries:

oc exec -it $(oc get pods --selector=pg-cluster=hagiscluster,role=master,pgo-pg-database -o name) -- psql hippo

CREATE TABLE users (
 id int GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
 username text NOT NULL,
 email text NOT NULL
);
INSERT INTO users (username, email) VALUES ('hippo', 'hippo@exampel.com'), ('zebra', 'zebra@example.com);

Finally, we need to set up the Debezium inventory connector which will actually facilitate the extraction of data from the PostgreSQL database into Kafka:

oc exec -i -c kafka broker-kafka-0 -- curl -X POST \
   -H "Accept:application/json" \
   -H "Content-Type:application/json" \
   http://connect-debezium-connect-api:8083/connectors -d @- <<EOF

{
   "name": "inventory-connector",
   "config": {
     "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
     "database.hostname": "pgsql",
     "plugin.name": "pgoutput",
     "database.hostname": "hippo"
     "database.port": "5432",
     "database.user": "postgres",
     "database.password": "datalake",
     "database.dbname" : "hippo",
     "database.server.name": "hippo",
     "table.whitelist": "public.users"
   }
}
EOF

We can verify that the data is streaming into Kafka by using the "kafka-console-consumer.sh", which acts as our sink. For example:

oc exec -it broker-kafka-0 -- /opt/kafka/bin/kafka-console-consumer.sh \
   --bootstrap-server localhost:9092 \
   --from-beginning \
   --property print.key=true \
   --topic hippo.public.users

Build Your Data Analytics Pipeline on OpenShift With the Red Hat Marketplace

Red Hat AMQ Streams, Red Hat Integration and Crunchy PostgreSQL for OpenShift, are available in the Red Hat Marketplace, an open cloud marketplace that enables enterprise companies to easily discover and access certified software for container-based environments across the hybrid cloud. Through the marketplace, customers can take advantage of responsive support, streamlined billing, and simplified governance with a single-dashboard to provide visibility and insights across clouds.


About the author

Red Hatter since 2018, tech historian, founder of themade.org, serial non-profiteer.

Read full bio