Artificial Intelligence and Machine Learning (AI/ML) have existed in various forms for over six decades. Yet AI/ML have historically failed to live up to the hype that has surrounded them, particularly when measured in business and monetary returns. While this corner has not yet been turned in a widespread sense, certain developments over the past decade have seen an explosion in its adoption, and certain principles and patterns have emerged in determining successful business outcomes.So what has led to the explosion of AI/ML over the past decade? There are myriad reasons, but McKinsey cites the convergence of three important contributory factors to recent progress:
- Big Data
The emergence of Big Data has resulted in massive amounts of information being generated and available to train AI algorithms. As I mentioned in a previous blog, Unlocking AI/ML Business Value with Kubernetes - Part 2: Data, significantly more data are being produced nowadays than ever before, with IDC reporting that 90% of the data in the world has been generated in the past two years.
- Algorithm Advancement
Machine-learning algorithms have advanced, especially in the areas of supervised learning and deep learning based on neural networks. Furthermore, algorithms have progressed, enabling the optimization of algorithms using techniques such as parallelism and matrixed and broadcast operations. These have paved the way for explosive improvements in speed, given the right hardware.
- Hardware (GPU) Acceleration.
Over the last decade, exponentially more computing power has become available for training much more complex AI/ML models. At the forefront have been developments in graphics processing units (GPUs) from innovative vendors like Nvidia. These hardware innovations have provided outlets for capitalizing on the algorithm advancements just mentioned.
The convergence of these and other factors over the past decade has led to a Big Bang in the potential realizable value from AI/ML. The focus of this blog, part of a series on how Kubernetes can address AI/ML business challenges, is the value that GPUs combined with Kubernetes can bring.
Why Use GPUs?
So now we have all of this valuable data and improved algorithms, let’s dig deeper into the case for expensive GPUs instead of traditional and less costly CPUs to process that data.
Moore’s Law, and its gradual breakdown, is part of the answer. In 1965, Intel’s co-founder Gordon Moore predicted a steady cadence of chip improvements that would double processors’ performance every two years or so. Moore’s law held true for many decades. But as the size of chip components approaches that of individual atoms, it gets more and more difficult to keep up the pace of Moore's Law. Nowadays that rate of improvement has decreased to approximately 10% per year. Some including Nvidia CEO Jensen Huang go so far as to proclaim Moore’s Law is dead. While there are those who strongly contest its demise, its inarguable decline does point to the demand for other solutions.
As CPU advancement has slowed down, there has been an explosion in GPU power, particularly in parallel processing, suited to more modern Deep Learning techniques. Today, GPUs routinely possess thousands of cores that are capable of orders of magnitudes of faster processing than is possible with CPUs. This makes the training of models requiring huge datasets an economically feasible proposition. For example, one cybersecurity malware detection model using deep learning that used to take about 100 days to train now takes less than a day.
GPU computing is set to continue the momentum in hardware and processor acceleration as Moore’s Law is increasingly challenged.
Image courtesy of Nvidia
Potential Challenges in Adopting GPU Accelerated Hardware
With such a compelling case for GPUs, why not just add GPU-enabled nodes to your Kubernetes cluster? There are two potentially prohibitive challenges to the widespread adoption of GPUs:
Kubernetes enables the addition to the cluster of specialized hardware resources such as NVIDIA GPUs, as well as other devices. However, configuring and managing nodes with these hardware resources requires the configuration of multiple software components such as drivers, container runtimes, or other libraries that are difficult and prone to errors.
- Availability and Cost
Demand for GPU-accelerated chips is far outweighing demand. This has led to availability shortages and escalations in cost, particularly on public clouds.
How can Kubernetes and OpenShift Address Challenges to GPU Adoption?
Enterprise Kubernetes platforms such as Red Hat OpenShift leverage a capability known as Operators and the Operator Framework.
Operators provide a method of packaging, deploying, and managing Kubernetes applications, simplifying the installation and ongoing maintenance of applications they manage. The NVIDIA GPU Operator uses the operator framework within Kubernetes to automate the management of all NVIDIA software components needed to provision GPUs on Kubernetes clusters. The NVIDIA GPU Operator makes adding GPU resources to OpenShift a straightforward proposition for cluster administrators.
The Nvidia GPU Operator runs on the Kubernetes cluster and greatly simplifies the provisioning, maintenance, and consumption of GPU enabling software and hardware.
The second GPU related challenge we raised is the dual issue of cost and availability. Their high demand and limited availability place a premium price on GPUs, especially on public clouds.
OpenShift enables enterprises to optimize GPU cost consumption in a number of ways:
- Utilization efficiency
Kubernetes clusters tend to be consumed by many teams, which facilitates sharing of cluster-based GPUs across teams. These shared resources reduce the likelihood that valuable GPU-accelerated hardware will be lying idle as teams can coordinate their shared model training and GPU usage schedules.
Furthermore, when a particular data scientist or team has completed GPU-consuming model training, those GPUs are returned to the cluster’s availability pool and are immediately consumable by other data scientists and teams.
- Infrastructure flexibility
This diagram depicts optimal GPU consumption on public cloud versus an organization’s own GPUs in the data center, according to volume of usage:
It is clear that occasional usage favors GPU usage on public cloud as lower levels of usage make hourly consumption more cost-effective.
However, as GPU utilization increases, it becomes more viable for organizations to purchase their own GPU hardware. As you pass the intersection in the graph, cumulative hourly costs on public clouds start to exceed the cost of owning and operating one's own hardware.
How does this impact consumption patterns? Platforms such as OpenShift, provide an identical user experience across the data center, the edge, and public cloud. This allows a seamless transition from one infrastructure such as the cloud to the data center, whenever that move makes economic sense.
In summary, there are many ways organizations can expect to realize business value utilizing GPU acceleration on OpenShift including:
- The Nvidia GPU operator lowers the technical barriers to GPU use and therefore speeds up time-to-value with GPUs for training and inference.
- OpenShift provides a highly efficient architecture that enables scarce GPU resources to be shared across teams and achieve a high level of utilization.
- The common experience across infrastructures allows a seamless usage transition from the public cloud to one’s own infrastructure when that becomes economically advantageous.
Follow the reference link below for how Volkswagen used OpenShift and GPUs to speed up autonomous vehicle testing.
The ever-increasing need to capitalize on Big Data as well as simultaneous advancements in AI/ML including Deep Learning have coincided with huge advancements in GPU-accelerated hardware. With the relative decline in the capability of CPUs to train data-hungry models, the case for GPUs for model training is becoming increasingly compelling. With recent developments such as the Nvidia Kubernetes Operator, the ease of GPU consumption is yet another reason for utilizing a Kubernetes container platform such as Red Hat OpenShift for AI/ML workloads and workflows.