Last year Tom Wilkie and I talked at KubeCon Barcelona about the future of Observability. One of our predictions was that correlation across signals would become more and more common. In this post I will demonstrate how we are leveraging those correlations every day within the production systems that the Red Hat Observability group operates, and how that helps us identify the right problems faster.
Our Stack
We use Prometheus, an open source monitoring for alerting on time-series. Our infrastructure runs on OpenShift, Red Hat’s distribution of Kubernetes. Kubernetes (and thus OpenShift) provides us with a simple API to read plaintext logs from our containers. To visualize and analyze latencies between and within individual parts of the infrastructure we use an open source distributed tracing tool Jaeger. And last but not least each of our processes exposes a pprof (the built-in profiles in the go runtime, but also available for rust, python, nodejs and java) endpoint for profiling, and we use Conprof to continuously collect profiles over time.
Debugging Time!
We recently rolled out a new version of Thanos in our infrastructure, and Prometheus quickly alerted us, about high latency.
Prometheus graph showing a steady increase in latency after a deployment.
We took advantage of the tooling we had at hand, to debug the problem, rather than rolling back to the previous version. This helped us pinpoint the exact problem and resolved the situation. This kind of situation could happen on any project, even in house product systems, where similar workflows could be applied.
We checked our slow query logs, of our component that first accepts the requests for queries, which immediately revealed a trace ID to us. Any query with latency higher than 5 seconds is written to the slow query log.
ts=2020-06-04T10:59:57.6170538Z component=query-reader msg="request finished" trace-id=23957fb8181c10d4
Log line from a slow query log, showing a trace ID from a slow query.
With this Trace ID, we could immediately pull up a trace in Jaeger of a problematic query.
Jaeger’s Trace detail page with detailed information expanded about the process, which created the span.
This was interesting because we could clearly tell that the overwhelming majority of the query was spent in the query engine, and even more importantly, we knew exactly at what time, and which process was crunching away.
We in addition to metrics, logs, and tracing continuously profile all processes we run with Conprof. With the information available from the trace, we could immediately query Conprof.
Conprof’s query interface showing many samples over time being collected.
And pull up a CPU profile from exactly that point in time, and lo and behold, we knew where the majority of CPU time was spent.
Graph representation of a CPU profile taken from Thanos, showing a clear dominant path.
Discovering this allowed us to cut ⅓ in query latency in Thanos.
Correlations Give Superpowers!
Only the combination of tools and correlating the information made this possible. The entire troubleshooting process took 2 minutes to find, for something that used to take up to days of searching for the needle in the haystack. While it is a bit of upfront effort to instrument applications with metrics, logging, and tracing, as this post hopefully shows, this effort pays off exponentially.
Ever since we have discovered this workflow it has been a powerful tool and has helped us uncover a number of performance improvements much more quickly than we used to.
Using these tools in this way almost felt like a superpower, it felt like cheating, because it was so easy. There is still a lot of work that can be done to integrate these signals more tightly, improve the user experience and automate the correlation aspects more, but what Tom and I predicted in the Keynote, is slowly becoming reality!
About the author
More like this
Browse by channel
Automation
The latest on IT automation that spans tech, teams, and environments
Artificial intelligence
Explore the platforms and partners building a faster path for AI
Open hybrid cloud
Explore how we build a more flexible future with hybrid cloud
Security
Explore how we reduce risks across environments and technologies
Edge computing
Updates on the solutions that simplify infrastructure at the edge
Infrastructure
Stay up to date on the world’s leading enterprise Linux platform
Applications
The latest on our solutions to the toughest application challenges
Original shows
Entertaining stories from the makers and leaders in enterprise tech
Products
- Red Hat Enterprise Linux
- Red Hat OpenShift
- Red Hat Ansible Automation Platform
- Cloud services
- See all products
Tools
- Training and certification
- My account
- Developer resources
- Customer support
- Red Hat value calculator
- Red Hat Ecosystem Catalog
- Find a partner
Try, buy, & sell
Communicate
About Red Hat
We’re the world’s leading provider of enterprise open source solutions—including Linux, cloud, container, and Kubernetes. We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.
Select a language
Red Hat legal and privacy links
- About Red Hat
- Jobs
- Events
- Locations
- Contact Red Hat
- Red Hat Blog
- Diversity, equity, and inclusion
- Cool Stuff Store
- Red Hat Summit