Introduction

OpenShift comes by default with a static Grafana dashboard, which will present cluster metrics to cluster administrators. It is not possible to customize this Grafana instance.

However, many customers would like to create their own dashboards, their own monitoring, and their own alerting while maximizing OpenShift without installing a separated monitoring stack.  

So how can you create your own queries? How can you visualize them on custom dashboards, without the need to install Prometheus or AlertManager a second time?

The solution is simple: Since OpenShift 4.5 (as TechPreview) and OpenShift 4.6 (as GA) are the default monitoring stacks, OpenShift has been extended to support monitoring of user-defined projects. This additional configuration will help on your own projects.

In this article, we will see how to deploy the Grafana Operator and what issues can occur when connecting Grafana to OpenShift monitoring.

Overview

As a developer in OpenShift, you can create an application that provides custom statistics of your application at the endpoint /metrics. Here an example from the official OpenShift documentation:

# HELP http_requests_total Count of all HTTP requests
# TYPE http_requests_total counter
http_requests_total{code="200",method="get"} 4
http_requests_total{code="404",method="get"} 2
# HELP version Version information about this binary
# TYPE version gauge
version{version="v0.1.0"} 1

 

This metric can then be viewed inside OpenShift in the developer view under the menu Monitoring. If you go to "Monitoring > Metrics" and select "Custom Query" from the drop-down, you can enter, for example, the following PromQL query:

sum(rate(http_requests_total[2m]))

The following graph will be the result:

Figure 1. Custom Query

This is great! But …​ what happens if a customer would like to see his very own super-fancy Grafana dashboard? You cannot change the cluster dashboard. However, you can install your own Grafana instance, and one way to do so is by using the Custom Grafana Operator.

Architecture

The following image depicts a brief overview of the user-defined workload monitoring, Grafana Operator, and an example application:

  1. Openshift-monitoring: This Is the default cluster monitoring, which will always be installed along with the cluster. It provides Prometheus and Thanos Querier and (not in the picture) a Grafana dashboard, which shows cluster metrics. 
  2. Openshift-user-workload-monitoring: This is responsible for customer workload monitoring. It deploys its own instance of Prometheus, which is queried by the Thanos Querier. This instance will scrape custom monitoring metrics based on so-called ServiceMonitor objects which are defined in no. 4.
  3. Grafana-operator namespace: This is the namespace where the Grafana-Operator will be deployed. It holds several custom resources, like GrafanaDashboard and GrafanaDatasource. Grafana used the Thanos Querier to get and visualize the appropriate metric.
  4. An example namespace (ns1): This provides an example application and a ServiceMonitor object. The ServiceMonitor will be used to scrape the metrics. 

Before We begin

The following should be prepared before you get started :

1: OpenShift 4.6 (4.5 is also possible. However, user-defined workload monitoring is only available as TechPreview there.)
2: Enabled user-defined workload monitoring.

A: Create and apply the following manifest for the cluster monitoring:

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
enableUserWorkload: true

B: Create and apply the following manifest for the user-defined monitoring:

apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |

3: Create an example project with user-defined workload monitoring. In this use case, we create the namespace “ns1” with an application that provides example metrics: 

apiVersion: v1
kind: Namespace
metadata:
  name: ns1
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
app: prometheus-example-app
  name: prometheus-example-app
  namespace: ns1
spec:
  replicas: 1
  selector:
matchLabels:
  app: prometheus-example-app
  template:
metadata:
  labels:
    app: prometheus-example-app
spec:
  containers:
  - image: quay.io/brancz/prometheus-example-app:v0.2.0
    imagePullPolicy: IfNotPresent
    name: prometheus-example-app
---
apiVersion: v1
kind: Service
metadata:
  labels:
app: prometheus-example-app
  name: prometheus-example-app
  namespace: ns1
spec:
  ports:
  - port: 8080
protocol: TCP
targetPort: 8080
name: web
  selector:
app: prometheus-example-app
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
k8s-app: prometheus-example-monitor
  name: prometheus-example-monitor
  namespace: ns1
spec:
  endpoints:
  - interval: 30s
port: web
scheme: http
  selector:
matchLabels:
  app: prometheus-example-app

 

NOTE: Further information can be found in the official documentation at https://docs.openshift.com/container-platform/4.6/monitoring/enabling-monitoring-for-user-defined-projects.html.

Deploy Custom Grafana Operator

As for any community operator, the following must be considered:

 

Community Operators are operators that have not been vetted or verified by Red Hat. Community Operators should be used with caution because their stability is unknown. Red Hat provides no support for Community Operators.

 

The community Grafana Operator must be deployed to its own namespace, for example grafana. Create this namespace first (oc new-project grafana) and search and install the Grafana Operator from the OperatorHub. 

You can use the default values;, just be sure to select the wanted namespace.

After a few minutes, the operator should be available:

Figure 2. Installed Community Grafana Operator

Setup Grafana Operator

Before we can use Grafana to draw beautiful images it must be configured. We need to create an instance of Grafana. Ideally, OpenShift OAuth is already leveraged, to avoid having to create a user account manually, inside Grafana.

OAuth requires some objects, which must be created before the actual Grafana instance. The following YAMLs are taken from the operator documentation. 

Create the following inside the Grafana namespace:

  1. Session secret for the proxy …​ change the password!!
  2. a cluster role grafana-proxy
  3. a cluster role binding for the role
  4. a config map injecting trusted CA bundles

apiVersion: v1
data:
  session_secret: Y2hhbmdlIG1lCg==
kind: Secret
metadata:
  name: grafana-k8s-proxy
type: Opaque
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: grafana-proxy
rules:
  - apiGroups:
  - authentication.k8s.io
resources:
  - tokenreviews
verbs:
  - create
  - apiGroups:
  - authorization.k8s.io
resources:
  - subjectaccessreviews
verbs:
  - create
---
apiVersion: authorization.openshift.io/v1
kind: ClusterRoleBinding
metadata:
  name: grafana-proxy
roleRef:
  name: grafana-proxy
subjects:
  - kind: ServiceAccount
name: grafana-serviceaccount
namespace: grafana
userNames:
  - system:serviceaccount:grafana:grafana-serviceaccount <1>
---
apiVersion: v1
kind: ConfigMap
metadata:
  labels:
config.openshift.io/inject-trusted-cabundle: "true"
  name: ocp-injected-certs


Note: Be sure to use your selected namespace. In this example namespace, “grafana” is used.

Now you can create the following instance under: 

"Installed Operators > Grafana Operator > Grafana > Create Grafana > YAML View" (or, as an alternative, via the CLI)

apiVersion: integreatly.org/v1alpha1
kind: Grafana
metadata:
  name: grafana-oauth
  namespace: grafana
spec:
  config: <1>
    auth:
      disable_login_form: false
      disable_signout_menu: true
    auth.anonymous:
      enabled: false
    auth.basic:
      enabled: true
    log:
      level: warn
      mode: console
    security: <2>
      admin_password: secret
      admin_user: root
  secrets:
    - grafana-k8s-tls
    - grafana-k8s-proxy
  client:
    preferService: true
  dataStorage: <3>
    accessModes:
      - ReadWriteOnce
    class: managed-nfs-storage
    size: 10Gi
  containers: <4>
    - args:
        - '-provider=openshift'
        - '-pass-basic-auth=false'
        - '-https-address=:9091'
        - '-http-address='
        - '-email-domain=*'
        - '-upstream=http://localhost:3000'
        - '-tls-cert=/etc/tls/private/tls.crt'
        - '-tls-key=/etc/tls/private/tls.key'
        - >-
          -client-secret-file=/var/run/secrets/kubernetes.io/serviceaccount/token
        - '-cookie-secret-file=/etc/proxy/secrets/session_secret'
        - '-openshift-service-account=grafana-serviceaccount'
        - '-openshift-ca=/etc/pki/tls/cert.pem'
        - '-openshift-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
        - '-openshift-ca=/etc/grafana-configmaps/ocp-injected-certs/ca-bundle.crt'
        - '-skip-auth-regex=^/metrics'
        - >-
          -openshift-sar={"namespace": "grafana", "resource": "services",
          "verb": "get"} <5>
      image: 'quay.io/openshift/origin-oauth-proxy:4.6'
      name: grafana-proxy
      ports:
        - containerPort: 9091
          name: grafana-proxy
      resources: {}
      volumeMounts:
        - mountPath: /etc/tls/private
          name: secret-grafana-k8s-tls
          readOnly: false
        - mountPath: /etc/proxy/secrets
          name: secret-grafana-k8s-proxy
          readOnly: false
  ingress:
    enabled: true
    targetPort: grafana-proxy
    termination: reencrypt
  service:
    annotations:
      service.alpha.openshift.io/serving-cert-secret-name: grafana-k8s-tls
    ports:
      - name: grafana-proxy
        port: 9091
        protocol: TCP
        targetPort: grafana-proxy
  serviceAccount:
    annotations:
      serviceaccounts.openshift.io/oauth-redirectreference.primary: >-
        {"kind":"OAuthRedirectReference","apiVersion":"v1","reference":{"kind":"Route","name":"grafana-route"}}
  configMaps:
    - ocp-injected-certs
  dashboardLabelSelector:
    - matchExpressions:
        - key: app
          operator: In
          values:
            - grafana

 

<1>

Some default settings, which can be modified if required

<2>

A default administrative user

<3>

A datastore to use a persistent volume. Other options would be to use ephemeral storage or another database. This might be especially important if you would like HA for your Grafana.

<4>

Container arguments, which are important for the openshift-sar line which, in turn, affects the OAuth

<5>

Be sure to use your select namespace.

 

After a few moments, the operator picks up the change and creates a Grafana pod.

Adding a Data Source

The next step is to connect your custom Grafana to Prometheus, or rather to the Thanos Querier. To do so, you add a role to the Grafana service account and create a CRD GrafanaDataSource.

At this moment, we will work with the cluster role cluster-monitoring-view. However, the problem that can result  is discussed later.

1: Add the role to the Grafana serviceaccount:

oc adm policy add-cluster-role-to-user cluster-monitoring-view -z grafana-serviceaccount

2: Retrieve the token of the service account:

export GRAFANA_NAMESPACE=grafana 

export BEARER_TOKEN=$(oc sa get-token grafana-serviceaccount -n $GRAFANA_NAMESPACE) 

3: Prepare the following yaml file as grafana-datasource.yaml:

apiVersion: integreatly.org/v1alpha1
kind: GrafanaDataSource
metadata:
  name: prometheus-grafanadatasource
  namespace: grafana
spec:
  datasources:
    - access: proxy
      editable: true
      isDefault: true
      jsonData:
        httpHeaderName1: Authorization
        timeInterval: 5s
        tlsSkipVerify: true
      name: Prometheus
      secureJsonData:
        httpHeaderValue1: 'Bearer ${BEARER_TOKEN}'
      type: prometheus
      url: 'https://thanos-querier.openshift-monitoring.svc.cluster.local:9091' <1>
  name: prometheus-grafanadatasource.yaml

      Note: Thanos default querier URL …​ this might cause problems (see below)
4: Create the grafana datasource:

envsubst < grafana-datasource.yaml | oc -n $GRAFANA_NAMESPACE apply -f - 

 

The operator will now restart the Grafana pod to add the newest changes, which should not take more than a few seconds. Grafana can be used now. Dashboards can be created …​ but let’s run some tests with PromQL queries instead.

Let’s Test

Log in to your Grafana using OAuth and a cluster administrator.

 

You could also use a non-cluster administrator, if the user is able to GET the services of the Grafana namespace. The reason is the following line in the Grafana CRD: -openshift-sar={"namespace": "grafana", "resource": "services","verb": "get"} which defines that OAuth will work for everybody who can get the service. This might be changed according to personal needs, but for this test, it is good enough.


Then use the credentials for the admin account, which have been defined while creating the Grafana instance.

You will be logged in now, and since there are no Dashboards, let’s go to Explore to enter some custom PromQL queries. For this instance, we will use our example from above:

sum(rate(http_requests_total[2m]))

Figure 3. First Query

This is looking good.  

Let’s give it another try and sort by namespaces:

sum(rate(http_requests_total[2m])) by (namespace)

Figure 4. Second Query - showing internal namespace

What is this? I see a namespace that is actually meant for the cluster (openshift-monitoring).

Let’s try another query using a different metric:

sum(node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate) by (namespace)

Figure 5. Third Query - shows even more namespaces

OK, so we have access to all namespaces on the cluster.

Why Do I see All namespaces?

What does this mean? Well, it means that we have access to all namespaces of the cluster. We see everything. This makes sense since we assigned the cluster role "cluster-monitoring-view" to the serviceaccount of Grafana. 

But what if we want to show only objects from a specific namespace, such as giving the developers the ability  to create their own dashboards without having view access to the whole cluster.

The first test might be to remove the cluster-monitoring-view privileges from the Grafana serviceaccount. 

This will lead to an error on Grafana itself since it cannot access the Thanos Querier, which we configured with: https://thanos-querier.openshift-monitoring.svc.cluster.local:9091

How does the Openshift WebUI actually work, when you are a developer and would like to search one of the above queries? 

Let’s try that:

Figure 6. Query using the OpenShift UI

It works! It shows the namespace of the developer and only this namespace. When you inspect the actual network traffic, you will see that OpenShift automatically adds the URL parameter namespace=ns1 to the request URL:

https://your-cluster/api/prometheus-tenancy/api/v1/query?namespace=ns1&query=sum%28node_namespace_pod_container%3Acontainer_cpu_usage_seconds_total%3Asum_rate%29+by+%28namespace%29

This is good information. Let's try this using the Grafana Data Source.

 

It is currently not possible to perform this configuration using the GrafanaDataSource CRD. Instead, it must be done directly at the Grafana Dashboard configuration. There is an open ticket at: https://github.com/integr8ly/grafana-operator/issues/309.


Login to Grafana as administrator and switch to "Configuration > Data Source > Prometheus >". At the very bottom add namespace=ns1 to the Custom query parameters:

thanos_vs_thanos_new

Figure 7. Configure Grafana Data Source

 

At this point, the Grafana serviceaccount has cluster_monitoring_view privileges.

As you can see in the following image, this configuration did not help:

Figure 8. Query after Data Source has manually been modified

Thanos Querier Versus Thanos Querier

To summarize, in the OpenShift UI everything works, but when using the Grafana dashboard, we see all namespaces from the cluster. Let’s try to find out how OpenShift does this.

When we check the Thanos services we will see three ports:

ports:
    - name: web
      protocol: TCP
      port: 9091
      targetPort: web
    - name: tenancy
      protocol: TCP
      port: 9092
      targetPort: tenancy
    - name: tenancy-rules
      protocol: TCP
      port: 9093
      targetPort: tenancy-rules

 

Currently, we configured port 9091, but there is another one, which is called tenancy.. Maybe this is what we need? Let’s try it:

1: Change the CRD GrafanaDataSource to use port 9092 (instead of 9091). This will restart the pod and remove the custom query parameter we configured earlier.

2: Remove the cluster-role:

oc adm policy remove-cluster-role-from-user cluster-monitoring-view -z grafana-serviceaccount

3: The serviceaccount of Grafana, must be able to view the project we want to show in the dashboards. Therefore, allow the Grafana serviceaccount to view the project ns1:

oc adm policy add-role-to-user view system:serviceaccount:grafana:grafana-serviceaccount -n ns1

4: Log into Grafana as administrator and manually change the Data Source and add namespace=ns1 to the setting Custom query parameters.

5: Rerun the Query since​ you will now see one namespace

Figure 9. Query with Thanos Querier on port 9092

What Happened?

So what actually happened here? We have two important ports for our Thanos Querier: 9091 and 9092.

When we check the Deployment of the Thanos Querier for these ports, we will see:

For the port 9091 it looks like the following:

spec:
[...]
      containers:
[...]
        - resources:
[...]
          ports:
            - name: web
              containerPort: 9091
              protocol: TCP
[...]
          args:
[...]
            - '-openshift-sar={"resource": "namespaces", "verb": "get"}'

 

There is an OAuth setting that  inidcates : you need the privilege to GET the objects "namespace".

The only cluster role that  has this privilege, which is also mentioned by the official OpenShift documentation, is cluster-monitoring-view:

- apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRole
  metadata:
    name: cluster-monitoring-view
  rules:
  - apiGroups:
    - ""
    resources:
    - namespaces
    verbs:
    - get

 

As we have seen above, this will show you all namespaces available on the cluster.

When you check port 9092, there is no such OAuth configuration. This service is actually in front of the container kube-rbac-proxy. It does not require OAuth, but instead the namespace URL parameter.

Details can be found at: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/user-workload-monitoring.md

In short, the whole setup looks like this:

Figure 10. Thanos interconnecting containers

While port 9091 goes directly to Thanos it will require that you have the cluster-monitoring-view role. Port 9092 does not require this, but instead, you must send the URL parameter namespace=.

Summary

While both options are valid, remember this about the Grafana Operator:.

  • Currently, the URL parameter can be set in Grafana directly only. The operator will ignore it. The ticket in the project shall address this, but is not yet implemented: https://github.com/integr8ly/grafana-operator/issues/309
  • The URL parameter setting will be gone when the Grafana pods is restarted, which might lead to a problem.
  • While the Grafana serviceaccount does not require cluster permissions, it will require permission to view the appropriate namespace
  • All the above also means that you actually would need to create a new DataSource for every project you want to monitor. I was not able to find a way to send multiple namespaces in the URL parameter.

Is it useful to use the Grafana Operator at all? Probably yes, since operators are the future, and operators are actively developed. Nevertheless, it is always possible to deploy Grafana manually.