Red Hat is the only company to offer self-managed and fully-managed Kubernetes on all major public clouds, which enables our customers to bring their applications to a secure, open hybrid cloud regardless of geographic or staffing restrictions.

Red Hat OpenShift Dedicated (OSD) is a fully-managed OpenShift service on AWS and Google Cloud, operated and supported by Red Hat, backed by a 99.9% SLA and 24x7 Red Hat Premium Support. OpenShift Dedicated is managed by Red Hat Site Reliability Engineers (SRE), who have years of security and operational experience working with OpenShift in development and production.

As described in a previous blog post, the OSD operations team has been transformed from a traditional Ops team to an SRE team. As part of this journey, a number of Kubernetes operators have been implemented to help automate recurring tasks on the growing number of clusters we maintain. One of the newer additions to our automation is the GCP Project Operator, which is used to automate the creation of Google Cloud Platform (GCP) projects whenever a user provisions an OSD cluster on GCP. This Operator was necessary to enable OSD running on GCP, which became generally available in April 2020.

Developing an operator has become the default choice when it comes to automating tasks in the Kubernetes environment. It allows you to borrow many things from Kubernetes that you don't have to re-create on your own, like its API management, which can be accessed easily from the CLI and especially from inside Go code. As a result, you can give your code an easy-to-use API, in the form of a custom resource, which users already know how to interact with from using Kubernetes itself. Is an operator the right thing to implement for every application? Definitely not. The Kubenetes documentation also gives some pointers on whether creating a new custom resource and implementing an operator is a good practice.

In this post we describe some of the things we learned from the journey of creating and maintaining operators. Operators can be implemented in different languages, from Ansible to Go. Some of the points discussed in this post can be applied independently of the implementation details. However, as most of the SRE operators are implemented using Go, this is where the majority of the practices presented here are based on.

1: Use the Operator SDK

When maintaining code of a number of operators, as we do for Red Hat OpenShift Dedicated, it is common that we have to contribute code to different operators, each implementing unique business logic. While a single developer may not be familiar with each and every operator we maintain, it's good to keep the surroundings familiar so developers know where to look for certain tasks. For example, every operator has a reconcile loop that is executed for every custom resource. There are many frameworks out there to abstract away this fact from developers, so they can focus on implementing the business logic.

We use the Operator SDK in all operators maintained by us so we don't have to learn a new framework every time we work on another operator. Operator SDK can be used to create all the boilerplate code and yaml manifests necessary to quickly kickstart a new operator along with its custom resources. The Operator SDK provides you with a Reconcile function for every custom resource you want to work on. This function is best compared to what you normally have as the main function. It's the entry point to your code.

2: Avoid Overstuffed Functions

The Reconcile function has multiple return codes, which cause implicit behavior you need to understand when working with the Operator SDK. The behavior of the operator is controlled by these return values. Depending on the return value, the request may be added to the reconcile loop again, delayed by a specified amount of time, or is considered done. This easily results in the Reconcile function soaking in all the business logic of the operator.

Avoid the temptation to add additional logic to it, just because it makes it easier to determine the output of the function. At minimum, you can create subroutines returning the same return codes as the ones being used by the Reconcile function. When implementing a program from scratch, you would avoid putting all the business logic inside the main function. Treat the Reconcile function the same way. Take care that you produce the correct return values, and work hard to split up the responsibilities.

This also makes it easier to have testable code (see section 6.Test Your Code, below). An overstuffed function is inherently hard to test and results in large and unmaintainable tests. Finally, hard-to-write tests tend to get lost and for good reasons, so just do your best to avoid that situation.

3: Idempotent Subroutines

So, what's a good way to split up the Reconcile function into subroutines? A good pattern is to have idempotent subroutines. Each of the subroutines performs one of the actions your program needs to take when a new custom resource appears or an existing custom resource changes.

In the example of the GCP Project Operator, we created subroutines to perform tasks like generating a new ID for the GCP project, configuring the project, and maintaining the status of the custom resource. Each of those subroutines is idempotent, checking if an action needs to be taken, and performing it when required. If that action has already been performed (for example, the project ID is already set on the custom resource), the corresponding subroutine just does nothing.

This allows us to create tests for every single action that will be performed on a custom resource, and keeps the Reconcile function clean and readable. 

4: One Custom Resource Modification at a Time

Each time the custom resource your controller is watching changes, the reconcile loop will run again. That includes changes done by a user but also changes you do in the Reconcile function or its subroutines. Often you need to update the custom resource you’re operating on to add information. An example from the GCP Project Operator is the ID of the project it created in GCP. This update will cause the reconcile loop to pick up the updated version of the custom resource and start another run of Reconcile.

You need to be aware of this, as changing the custom resource and proceeding in processing can result in race conditions with the newly created request. If parallel processing is enabled, it immediately starts running the Reconcile function. In this case, you must consider there could be a second request working on this resource at the same time in each line of your code. Even if requests are not processed in parallel, reconcile requests will pile up when updating a CustomResource over and over again, keeping the operator unnecessarily busy.

To lower the risk of race conditions and to avoid piling up requests, make sure you don't perform multiple changes to your custom resource or dependent actions in a single run of Reconcile. Whenever you update the custom resource you're watching, just exit the reconcile loop, and let the next run continue. All your idempotent functions executed before will do nothing, and you can continue where you left off.

5: Wrap External Dependencies


External dependencies should be wrapped in every program you write. That doesn't exclude your operator code, even if you think it just automates some simple trivial task. Believe me when I say it will evolve to something more complex, and at that time, you will wish it was handled as a piece of software from the beginning. Decoupling a grown codebase from a library that is used in many different packages is much harder than wrapping it right from the beginning, when the codebase is clean.

External dependencies may include external APIs you need to call. In the example of the GCP Project Operator, we use the GCP client to create projects and service accounts or to enable APIs. Wrapping this external dependency allows us to make it fit into the design of the operator. Also, it improves the testability of the code, as we can now easily create mocks for the external dependency, avoiding to invoke the actual GCP client from inside the tests.

To some extent, you should also treat the code generated by the Operator SDK as external dependency, even if it is under your control once generated. As mentioned above, the Reconcile function the Operator SDK provides is hard to test, and you can easily create a high-level component test if the Reconcile function interacts with your production code via an interface. All unit tests can then test your real production code. The production code probably is even located in a different package to make your design independent of what's provided by the Operator SDK.

6: Test Your Code

You may have realized I mentioned testing many times in the previous sections. I won't explain why tests are important for your code, as there is good literature already out there. I just want to make sure you are aware that this important discipline also applies to your operator. Every Kubernetes Operator that is considered to be production ready deserves the same amount of care as any other piece of production software.

All of the aforementioned points make for good testable code. But this does not just work in one direction. By writing tests early in the development instead of postponing it to a future iteration, you will silently get to most of those points. When you are the first user of your code, you automatically create usable code, because

  • Mocking external APIs is easy if they are wrapped.
  • Idempotent subroutines can be tested independently.
  • Single changes to CRs can be tested easily.
  • Overstuffed functions are hard to test.

7: Reconciling Return Values

Go allows users to have multiple return values. As mentioned, the return values of our main function, the Reconcile, influences the behavior of the operator, which is not obvious from reading the code. You can make the behavior visible by creating descriptive return values for the Reconcile function.

By using the raw return values for Reconcile, you will often use the following statement to tell the Operator SDK that this request is reconciled successfully:

return Reconcile.result{}, nil

When reading code that contains many of those statements, combined with others that need to be queued again, because an error has occurred, it’s hard to keep track of the operator’s business logic:

return Reconcile.result{}, err

Using helper functions can improve the readability of your code drastically. For example, consider the following two statements to wrap the reconciliation results mentioned above:

return DoNotRequeue() 
return RequeueWithError(err)

Even without knowing much of the Operator SDK return values, you can understand what these statements will do.

Descriptive return values do not only improve readability of the Reconcile function, but this function has a particularly tough combination of return values.

In general, whenever you can avoid having multiple return values, try to do it. This may exclude a second argument for errors you find in many functions in Go.

Summary

Many of the points we mentioned have been about readability as well as testability. To ensure maintainability over a number of operators, it's important to make it easy for members of the team to work on each and every operator. This can be achieved by caring for the following two things in every operator.

1: Readable Code

In programming, most of the time is spent in reading code: The code you wrote yesterday, the code someone else in the team wrote, or code of an external dependency you are using. Help yourself and your contributors with this task and make your code easily understandable.

2: Confidence in the Test Suite

Changing the behavior of an operator can be dangerous, as they often perform critical tasks. It's extremely important that you have confidence in the test suite of your operator, so you can enhance and refactor it without fear of breaking anything in production.

These two qualities go hand in hand and apply to many different software projects, but they especially do for Kubernetes operators handling critical parts of your infrastructure.