Red Hat is the only company to offer self-managed and fully-managed Kubernetes on all major public clouds, which enables our customers to bring their applications to a secure, open hybrid cloud regardless of geographic or staffing restrictions.. 

Red Hat OpenShift Dedicated is a fully-managed OpenShift service on AWS and Google Cloud, operated and supported by Red Hat, backed by a 99.9% SLA and 24x7 Red Hat Premium Support. OpenShift Dedicated is managed by Red Hat Site Reliability Engineers (SRE), who have years of security and operational experience working with OpenShift in development and production. 

The Red Hat Site Reliability Engineering (SRE) team’s goal is to build and scale OpenShift as a managed-service offering across multi- and hybrid-cloud platforms and enable a consistent experience for customers. Running any managed-service offering, especially one across multiple public cloud providers, is serious business requiring technical chops, business acumen, and a sense of accountability and urgency to drive change. To truly succeed, we have to understand and immerse ourselves in the following key areas:

  1. Each cloud vendor’s offerings, roadmaps, schedules, technology stacks, and costs  
  2. The Kubernetes community and Red Hat OpenShift product direction 
  3. Each customer’s use cases, including challenges they intend to overcome and desired future state 
  4. Operational expertise required to run a product as a service
  5. The impact of architectural changes to compliance and product security 

 

To accomplish all of thisin a well-packaged and cohesively managed offering like OpenShift Dedicated requires us, at a minimum, to be brilliant at the basics, and that includes securing, patching, monitoring, upgrading, and maintaining our entire fleet 24x7 so our customers can focus on their core value proposition. To be clear, each of the above is a distinct career track, so trying to do them all will and should feel like working in a fast- paced, bleeding-edge technical environment. In other words, it's like drinking from a perennial firehose, and the awesome people that thrive in this environment are those that make up the Red Hat SRE team. 

 

Let's take a look behind the curtains to understand what it takes to bring our hosted services like OpenShift Dedicated to life and what is required to be an awesome SRE at Red Hat.

It Takes a Team

Realizing our hyperscale ambitions require a village of awesome engineers to work, communicate, and interact with dignity, trust, and integrity. Only our combined effort will ensure that we are collectively a lot stronger, smarter, and a force to reckon with. The key is how you build and sustain teams and harness that collective energy towards something valuable like building a managed offering. This is a journey that we, as a team, are in together, and, as with any journey, we have to watch for pitfalls. One common challenge that many teams succumb to is the lack of psychological safety when the work environment becomes highly volatile, uncertain, complex, and ambiguous. This materializes in many different ways. Some teams, for example, get caught up in hero worship, a vicious cycle where teams get technically indebted to the knowledge and skills of a few individuals or where a few individuals dominate conversations and force their technology preferences over others even when they may not be the best choice. SRE teams are especially susceptible because they deal with a variety of different challenges like competing priorities or hair-on-fire escalations. In many ways SREs are the leading indicators of the health of the overall organization. As SREs at Red Hat, we overcome these challenges with good hiring practices, organizational structure, internal processes, and most importantly, a strong culture towards building sustainable teams.  

The Observability Mindset

As SREs we endeavor to build systems and services that are:

  • Observable 
  • Reliable 
  • Scalable 
  • Secure 
  • Performant
  • And most importantly, boring  

Why boring? Because we believe that strong foundations enable innovation and business value to be realized, and similar to basic utilities like water and electricity, a strong foundation should just be humming along in the background. Feeding and caring for strong foundations require data, lots and lots of it. We strive to make data-driven decisions by ensuring that our systems generate enough “digital exhaust” to help us proactively detect and self-heal defects, reduce mean-time-to-resolution (MTTR), and help prioritize fixes over features so our foundation continues to remain strong. 

Step one towards achieving an observability mindset is to ensure that everyone on the team builds and deploys code with quality in mind. That means they understand good code hygiene like why code is changing, whether unit and functional tests are in place to validate the use case, and how static code analysis tools detect anomalies. Once standards are established and engineers understand what they are trying to improve, why, what, and how to measure success or failures, communication channels will open to foster the key ingredient in observability, human interaction.

Continuous Learning Via Dogfooding

Engineering a product or an application is an area that has matured over the years and is fairly well understood. Running a product as a service, at scale, across multiple cloud providers with a guaranteed SLA is relatively new space and, in my opinion, a lot more nuanced because of disparate moving parts. What better way to improve the offering than dogfooding it ourselves.  In addition to CI clusters, we deploy our own business-critical production applications like Red Hat Insights and cloud.redhat.com on OpenShift Dedicated. This pushes us to have a deeper understanding of our own offering, identify bugs way before our customers do, and most importantly, encourages a learning mindset. A nice bonus is it builds muscles in empathy and humility to acknowledge our own gaps, allowing  giving and receiving feedback without prejudice. This in many ways is vital for the success of any team and especially an SRE team like ours because we are both customer zero, where we are the first to build, deploy, and use our own products for enterprise grade use cases, and first responders, where we interact with customers, partners, the community, and internal engineering teams to share our findings.  

The Last Mile Challenge

Bringing people, processes, and technology together to enable a cohesive offering is what I call the Last Mile Challenge. This is where the most finger-pointing happens. This is where communication breaks down. This is really where things start to unravel. For the last mile to be successful, you need a team with a strong sense of ownership, accountability, and leadership along with a desire for continuous improvement to plough through challenges. This is exactly what we strive for everyday as SREs at Red Hat, and if this sounds interesting, look up our job opportunities because we are hiring

 

Keep Calm and SRE!