Red Hat is the only company to offer self-managed and fully-managed Kubernetes on all major public clouds, which enables our customers to bring their applications to a secure, open hybrid cloud regardless of geographic or staffing restrictions..
Red Hat OpenShift Dedicated is a fully-managed OpenShift service on AWS and Google Cloud, operated and supported by Red Hat, backed by a 99.9% SLA and 24x7 Red Hat Premium Support. OpenShift Dedicated is managed by Red Hat Site Reliability Engineers (SRE), who have years of security and operational experience working with OpenShift in development and production.
In the early days we managed OpenShift Dedicated like a traditional Operations team and suffered the same challenges every Ops team faces:
- One-off automation scripts
- Product development teams tossing code over the fence for Ops to run
- Lack of management commitment to invest in Ops
- Siloed knowledge
- Applying duct tape and bail wire on key infrastructure because failure was scorned
- Lack of a cohesive vision around hosted product offerings
- Fighting to keep lights on over pushing new features
The team was a small co-located group of engineers who built the initial managed offering from the ground up. Our goal to deliver a fully managed offering with a guaranteed SLA meant tools for monitoring, managing, and lifecycling clusters needed to be re-imagined because kubernetes was still a new technology in the market. Thanks to the team’s efforts, OpenShift Dedicated version 3 today stands on many key infrastructure components built by this team. However, our focused efforts also meant knowledge was siloed, and we knew it wasn’t sustainable or healthy long-term. As OpenShift Dedicated adoption continued to accelerate rapidly, the need to change our approach, be it our team processes, hiring practices, culture, or technical skills, became a necessity.
Adopting the SRE Methodology
About five years ago we realized that unless we completely changed our approach, we could not scale and decided to go all-in on the Site Reliability Engineering (SRE) methodology because it was a proven model for our hyperscale ambitions. We learned that two things were critical for success:
- Executive support because we were taking a big risk by changing what seemingly looked like a well-oiled machine from the outside.
- Organizational change management to ensure SREs had a seat at the table along with the product teams, and weren’t just an afterthought. This also helped ensure everyone had clear OKRs with quantifiable outcomes thanks to Service Level Indicators (SLI), Service Level Objectives (SLO), and error budgets, and didn’t revert back to old habits.
With both executive support and implementing organizational change management, we set our eyes on our own internal hygiene. However, we knew not everyone would or could make the jump from traditional Ops to SRE. Let’s review how we evolved our practice over the years.
Hiring for Potential
Hiring practices are an essential ingredient towards building strong teams. Who we hire, how we hire them, and the location we hire influence muscles teams have to build to achieve their goals.
Finding good SREs is akin to finding a needle in a haystack: it takes time and a lot of patience. Over the years, we have evolved our approach to look for software engineers with a systems engineering mindset or systems engineers with a development background to fill our roles. I can tell you that timing does not always align with our hiring plans so that when we find a strong candidate we are ready to move quickly. To be clear, we are not looking for the perfect candidate. We are simply after potential, over perfection, and will dig into this topic in more detail another time. In other words, we are willing to invest in the right candidates and help them realize their full potential as a SRE.
Building a Distributed Team
I spoke about what it took to bring OpenShift Dedicated to life at Red Hat in my previous blog. Realizing our ambitions meant we had to hire engineers with a high degree of cognitive flexibility, or simply put, a learning mindset. Not surprisingly, hiring a distributed team did not just make business sense but, in many ways, forced us to evolve and build a learning mindset because the team had to learn and adapt to different cultures, time zones and norms. It was critical that everyone on the team is happy doing what they love for us to win.
Let's take a look at a few drivers behind our geographic spread and how it helped elevate the team:
- We believe great talent can come from anywhere, so we focus on hiring individuals with similar ethos, but we value diversity, be it different experiences, cultures, race, sex, or backgrounds. What better way to do this than cast a wide net to attract the best and most qualified individuals.
- We are big believers in maintaining a good work-life balance and that means on-calls are off limits at night, unless of course there are extenuating circumstances. With a distributed team we are able to meet our business demands, spend quality time with our loved ones, and have a stronger localized relationship with our customers.
- We understand that individuals do their best work when they are “in the zone.” Getting in the zone is different for different individuals, as some like to work with music on, some are night owls, and some even like to have their TV running in the background. What ultimately matters is not how someone works, but the outcomes they help deliver. For example, we have individuals today based in the Eastern Time Zone but prefer very early morning hours, working as though they are based in Europe. A distributed team allows individuals to stay connected to their teams and be the best they can be.
Culture is and has been a key ingredient towards being a vibrant SRE organization. The tone we set as leaders make or break trust, cohesion, and cross-team collaboration. As SREs, we subscribe wholeheartedly to the blameless culture mentality. We also realize that culture is a living, breathing organism that evolves with the individuals we hire. To that end, we subscribe to a few simple norms that help us nurture our SRE culture:
- It's OK to fail. Lets first acknowledge that we all make mistakes; we are human. However, it is never about the size of the mistake or the number of mistakes we make, but rather how and what we learn from them so we never repeat the same mistake twice.
- Assume positive intent. Out of sight is not out of mind in a distributed world. Building empathy towards people you never see face to face but whose work affects our day-to-day is an important but hard skill to build. As SREs, we commit to habits like making sure meeting participants have their cameras on, record, and share all our meetings. Being a global team, our normal routine includes video shift handoffs, meeting virtually on Fridays over food and a drink to “hangout,” or rotating our team meetings between APAC and EMEA time zones.
- Start with trust, and extend trust. Change management without trust, in my opinion, is impossible, and as SREs, change is our livelihood. Juggling multiple balls — monitoring, incident response, toil, capacity planning, feature development, customer engagements, etc. — is an occupational hazard for SREs. However, no single person is that good at multitasking, but as a team we can take on multiple things if and only if there is trust amongst team members. Failing within the safety net of your own team, being able to accept mistakes, and learning from them all require trust. Further, you need to trust the team to create and adapt the processes they work with every day. Let them be the experts.
- Disagree and commit. In my opinion, the open source principle “release early and often” is a corollary to “disagree and commit.” In other words, we cannot succeed or fail if we don’t experiment, and we can’t experiment if we don’t commit to a path even if there is some disagreement. The intent here is to learn in real time from our decisions and course correct. Put differently, analysis paralysis is death by a thousand paper cuts, and something every team should avoid.
- Communicate consistently and often. As customer zero, SREs run point on incidents and often run into critical bugs or performance issues. We then have the enviable task of delivering clear, concise, data-driven updates with empathy because we are not always the bearer of good news. Soft skills are therefore vital to being a successful SRE, not just for internal team communication but for external engagements as well.
Where Are We Now?
A highly available service like OpenShift Dedicated also requires a highly available team. We are now in 14 countries across 4 continents with 80% of the team working remotely. Being part of a distributed team has made us more human and improved our resilience to change be it organizational, regional, or otherwise. It has made us more nimble, and more importantly helped us connect and help our customers when they need us the most. To be clear, we are still on the journey to being better SREs. We have had our fair share of skeptics and naysayers, but we let our work do the talking and are humble enough to acknowledge mistakes and learn from them to continue our march onward.