Today we're proud to announce the open sourcing of our montioring scripts. The OpenShift Online Operations Team has published the OpenShift-Zabbix repository containing the monitoring scripts to monitor an OpenShift installation.

We use these scripts to monitor OpenShift Online environment using Zabbix. They are aimed at giving OpenShift Enterprise and OpenShift Origin users a good starting point for monitoring their OpenShift deployments as well.

Repository Structure

The OpenShift-Zabbix repository is structured in the standard Puppet module format. We don't expect every consumer to also use Puppet in their infrastructure. Puppet is the tool used in the OpenShift Online environment and also provides documentation on how the scripts are intended to be deployed. Users can consume the Puppet manifests as-is, or use them as a guide to integrate the scripts into their own configuration management infrastructure.

The scripts themselves reside in the files/checks/ directory. There are library files in the files/lib/ directory. These files are expected to be deployed to a bin/ and lib/ directory. (e.g. /usr/local/bin and /usr/local/lib)

Also included in the repository is the files/xml/ directory. This directory contains XML-based template files used by Zabbix to create items, triggers, and graphs. These files will make it easy to quickly configure Zabbix for monitoring the supplied data points.

Finally, the manifests/ directory contains the configuration documentation in the form of Puppet code. The primary details that can be found here are package requirements the scripts will need and the cron jobs that ultimately execute the check scripts to push data into the Zabbix server.

Design Decisions

The decision to use cron to execute the monitoring checks is a result of operational experiences gained by the OpenShift Online Operations Team. We found that Zabbix agent checks tend to have difficulty running at scale. As the number of items grows, the zabbix-agent and the zabbix-server's poller processes can struggle to collect data in a timely manner.

Our solution to this is to run all of our checks using the zabbix_sender command, which reads a file containing item data, and will push that data up to the Zabbix server. Some of our checks involve checking things with inherent and unpredictable latency, such as a ping through MCollective from the Broker to the Nodes. This made cron a reasonable choice, given the tradeoffs we needed to make between having fast checks and monitoring inherently variable components.

Current Checks

With the initial release, there are five check scripts that are included in the repository. Here is a brief description of what each one does.

  • check-accept-node
    • Runs the 'oo-accept-node' command, attempts some automated fixes to known bugs/problems, and then returns the command status.
  • check-activemq-stats
    • Collects statistics within the JVM running ActiveMQ for tracking common performance data.
  • check-district-capacity
    • Runs the 'oo-stats' command to collect data about capacity utilization of an OpenShift district. Reports available uids and gears within a district.
  • check-mc-ping
    • Runs 'mco ping', collates the results, and identifies nodes that are not responding in a timely fashion.
  • check-user-action-log
    • Scans /var/log/openshift/user_action.log, parsing event data from the log to provide insight into the health of the broker and general user experience interacting with the service.

Get started!

The OpenShift-Zabbix repository should contain everything you need to get started monitoring your OpenShift deployment. The README.md documents, everything you'll need to get started. If you're interested in contributing and collaborating with us, fork the code and send us a pull request; more information is in the COLLABORATING.md file.

For us, this is a starting point for open sourcing more of our monitoring work over time. We are interested in providing examples and ideas about how to make the most of your OpenShift installation. In addition, we would like to encourage all operations teams running OpenShift to engage in a conversation about how we keep our infrastructure running. We are excited to help make running OpenShift in a production environment an even easier and better experience.

Next Steps

  • Try these scripts out in your environment and let us know what you think in the comments
  • Don't have a running OpenShift environment to try this out? Install OpenShift Origin with one shell command and you'll be able to do it in no time.
  • Watch the video above to learn monitoring best practices for OpenShift