On May 19th, in the early morning (Eastern Daylight Time (EDT)) quay.io went down. This affected customers of quay.io and open source projects that are using quay.io as a build and distribution platform for their software alike. Red Hat is firmly committed to serve both of these audiences well with this service.
Red Hat’s site reliability engineering (SRE) team quickly engaged and worked diligently to stabilize the Quay service. During this time, clients were unable to push any new images and only sporadically could pull their images. Something was causing the quay.io database to lock up after scaling the service to full capacity.
During an incident like this the first question to ask is, “what’s changed?” We noticed that just prior to the incident the OpenShift Dedicated cluster on which quay.io runs had started an upgrade to version 4.3.19. Because quay.io runs on Red Hat OpenShift Dedicated (OSD), regular upgrades were a routine occurrence that had never presented any issues. In fact, in the preceding six months, several OSD upgrades were applied to Quay clusters without disruption to service. While we attempted to restore the service, a parallel effort began to spin up a new OSD cluster on the prior version so we could re-deploy if necessary.
Root cause analysis
The primary symptom of the outage was a storm of tens of thousands of database connections, effectively locking our MySQL instance. Because of this, it was difficult to diagnose the issue. We set a cap on the maximum client connections to allow our SRE team to evaluate the issue. We didn’t see any unusual database traffic; in fact, most of the queries were reads with a handful of attempted writes.
We also spent time trying to find a pattern in the database traffic that might cause such a connection storm. We did not find any patterns like this in the logs. While we waited for the new 4.3.18 OSD cluster to be completed, we continued to attempt to start up the quay.io pods. Each time we reached full capacity, the database would lock up. This meant having to restart our RDS instance as well as all of the quay.io pods.
By late afternoon, we had stabilized the service in read-only mode and disabled as many non-essential features as possible (e.g. namespace garbage collection) to reduce database load. While the lockups had stopped, we were still looking for the root cause. Our new OSD cluster was ready and we migrated the service, re-enabled write traffic, and continued to monitor.
Quay.io continued to run without incident on the new OSD cluster, so we continued to review the database logs but couldn’t find any correlation to explain the lockups. The OpenShift engineers worked with us to try to identify any changes in Red Hat OpenShift 4.3.19 that might be causing Quay problems. We did not find an issue and could not reproduce the issue in a lab environment.
On May 28 just before noon EDT, quay.io went down again with the same symptom: a locked database. All hands were on deck again. We immediately focused on service restoration. Our original procedure of restarting RDS and bouncing the quay.io pods again didn’t seem to help - another torrent of database connections were bringing us down. But why?
Quay is written in Python and each pod runs as a single monolithic container. Within the container runtime, there are many concurrent tasks going on at any one time. We use the gevent library under gunicorn to handle web requests. When a request comes into Quay (either through our own API or the Docker API), a gevent worker is assigned the request and that worker usually needs to talk to the database. After the first outage, we discovered that Quay was using the default database connection settings within our gevent workers. This meant that based on the large number of Quay pods and the thousands of incoming requests per second, it was theoretically possible that the number of database connections could overwhelm our MySQL instance. We knew from our monitoring that on average, Quay sustains about 5,000 requests per second and sees roughly that number of database connections on a constant basis. 5,000 connections were well within the limits of our RDS instance, but tens of thousands were not. Something was causing the connection count to suddenly spike but we couldn’t see a correlation against our incoming request traffic.
We remained focused on identifying and solving the issue, not simply restarting everything without a long term fix. We made a change to the Quay codebase to limit the database connections per gevent worker. The change made it a parameter in our config so we could easily change it without a new container image. To make sure we knew how many connections we could handle, we ran some tests in our staging environment with different values to see how it affected our load testing scripts. Eventually, we hit on 10,000 connections as the maximum before Quay began returning 502 errors to users. We redeployed this new version without delay and watched our graph of database connections. We knew that in the past we usually had a database lockup after about 20 minutes. After 30 minutes without incident, we were encouraged and an hour later we were convinced. We restored write traffic on the site and began our postmortem activities.
While we knew that we had avoided the issue causing the lockups, we still didn’t have a root cause. We confirmed that the issue was not related to any changes inOpenShift 4.3.19 since this issue occurred on version 4.3.18, which previously worked with Quay without issue. Something else was clearly going on.
Quay.io ran with the default database connection settings for nearly six years without ever showing this behavior. What had changed? Certainly the traffic on quay.io has grown steadily over that time period, but it felt like we were hitting a threshold that was causing the connection storms. We continued to review the database logs from the second outage and saw no patterns or obvious connections.
In the meantime, our SRE team was improving their observability into Quay’s requests and overall health. New metrics were deployed as well as new dashboards that showed which parts of Quay customers were using.
Quay.io ran without incident until June 9. In the morning (EDT), we again observed a significant increase in database connections. We didn’t suffer any downtime (our new database connection parameter meant we couldn’t exceed MySQL’s connection ability), but for about 30 minutes, quay.io was slow for many users. We quickly gathered as much data as we could from the added monitoring. Suddenly, a pattern emerged.
Just prior to the connection spike, there was a large number of requests coming into our App Registry API. App Registry is a lesser-known feature of quay.io which allows things like Helm charts and containers with rich metadata to be stored. While most quay.io customers don’t use this feature, Red Hat OpenShift is a large user. The OperatorHub within OpenShift uses App Registry to host all of the Operators. These Operators form the foundation of OpenShift’s workload ecosystem and the partner-focused technology “Day 2” operating model.
Every OpenShift 4 cluster uses Operators from the embedded OperatorHub to serve a catalog of available Operators to install and provide updates to already installed Operators. As OpenShift 4 adoption has increased, so has the number of clusters globally. Each one of those clusters needs to download Operator content to run the embedded OperatorHub, using the App Registry inside quay.io as a backend. The correlation we missed was that while OpenShift usage had been steadily increasing, this was putting more and more pressure on one of the less commonly used features of quay.io.
We did some analysis from the App Registry request traffic into the App Registry code. We immediately found several code paths that resulted in non-optimal queries against our database. Under light load they were no trouble, but under heavy load we started to see how they could cause problems. There were two App Registry endpoints that were having issues with the heavier load- one that listed all packages in a repository and one that returned all blobs for a package.
For the next week, we made optimizations both on and around the App Registry code. We refactored obviously inefficient SQL queries, removed an out of process “tar” command being run on every blob retrieval, and added caching wherever possible. We ran extensive performance tests comparing our App Registry performance before and after the changes. API calls that previously took half a minute were now taking milliseconds. The following week, we deployed the changes into production and since then quay.io has been stable. We have monitored several sharp traffic spikes on the AppRegistry endpoint since then, but the improvements have prevented database lockups.
What Did We Learn?
While no provider wants downtime, we believe that the recent outages have helped to make quay.io a better service. There are a few key learnings that are worth sharing:
- You can never have enough data about who and what are using your service. Since Quay “just worked”, we never needed to spend much time analyzing our traffic shape to handle the load. This created a false sense of security that the service would scale indefinitely.
- When service is down, restoration is your top priority. Because Quay continued to experience database lockups during the first outage, our standard procedures weren't having the intended effect of service restoration. This led to a situation where we spent more time performing analysis and gathering data in the hopes of finding the root cause instead of putting every effort on getting our clients up and running again.
- Understand the impact of every one of your service’s features. App Registry was seldom used by our customers, so it wasn’t a major priority for our team. When you have seldom used features in your product, bugs don’t get filed and developers stop looking at the code. It’s easy to assume that this puts no burden on the team- until suddenly it is part of a major incident.
The quest for a stable service is never finished and we have to continuously improve what we do. The amount of traffic on quay.io continues to grow and we know we need to earn our client’s trust on a daily basis. To that end, we’re working on:
- Deploying read replica databases to help the service handle read traffic in the event that our main RDS instance has issues.
- Upgrading quay.io’s RDS instance. The current version itself is not an issue but a red herring that we explored during the outage; staying current will eliminate one more variable in a future outage.
- Adding more caching across the cluster. We continue to identify areas where caching can help reduce load against our database.
- Adding a web application firewall to give us better visibility into exactly who is hitting quay.io and why.
- Red Hat OpenShift clusters, starting with the next release, will transition away from App Registry in favor of Operator Catalogs based on container images hosted on quay.io.
- The long-term replacement for AppRegistry is support for the Open Container Initiative (OCI) artifact specification that is currently being implemented as a native feature of Quay and will be made available as soon as the specification is final.
All of the above are part of Red Hat’s continued investment in quay.io as we transition from a small ‘startup-style’ team to a mature SRE managed platform. We know that many clients depend on quay.io on a daily basis (including Red Hat!) and want to be as transparent as possible about these recent outages and what we’re doing to continuously get better over time.