In recent years, Artificial Intelligence (AI) and Machine Learning (ML) have surged in importance. This rise can be attributed to a massive influx of data, enhanced computational capabilities, and significant advancements in algorithms. These changes have empowered various industries and society at large, resulting in cost-effective production and services, heightened availability, and increased adaptability across diverse scenarios and environments.
Yet, AI and ML represent just the tip of the technological iceberg. For such complex processes to function optimally, the underlying infrastructure must be robust enough to handle immense capacity demands and stringent requirements. Canonical is proud to announce the integration and support for the NVIDIA GPU and Network Operator in Canonical Kubernetes. With these features, organisations will benefit from further advancements to implement AI/ML at scale. This article delves into how Canonical and NVIDIA have collaborated to craft an infrastructure tailored for AI/ML operations.
Kubernetes by Canonical delivered on NVIDIA DGX systems
Kubernetes plays a pivotal role in orchestrating AI/ML applications. Canonical offers a Kubernetes distribution tailored for distinct use cases in data centres or edge computing. It is optimised to enhance AI/ML performance, incorporating features that amplify processing power and reduce latency developed and integrated in a tight collaboration with NVIDIA.
NVIDIA DGX systems are purpose-built to meet the demands of enterprise AI and data science, delivering the fastest start in AI development, effortless productivity, and revolutionary performance — for insights in hours instead of months. The system is designed to maximise AI throughput, providing enterprises with a highly refined, systemized, and scalable platform. Certified as part of the NVIDIA DGX-Ready Software program, Canonical’s Kubernetes solutions improve AI workflows and utilisation of AI infrastructure on DGX systems.
Canonical Kubernetes support GPU management and optimisations
Using GPU acceleration, the processing power is enhanced by shifting tasks from the CPU to the GPU. Canonical Kubernetes makes use of the NVIDIA GPU Operator to enable workload data processing on NVIDIA Tensor Core GPUs. The GPU Operator provides a fast and efficient way to use NVIDIA host drivers and the ability to load kernel drivers dynamically at runtime, as well as automatically configuring other important components such as the container toolkit and device plug-in.
Canonical Kubernetes seamlessly integrates with GPU-enabled instances tailored for applications that leverage them, using ‘GPU workers’. The deployment intuitively recognizes NVIDIA hardware and activates the necessary support.
Canonical Kubernetes support for NVIDIA Multi-Instance GPU
To ensure a guaranteed Quality of Service (QoS) for a workload through resource allocation, NVIDIA Multi-Instance GPU (MIG) technology is indispensable. MIG enhances the efficiency and value of NVIDIA H100, A100, and A30 Tensor Core GPUs. With the capability to divide a GPU into up to seven distinct instances, each is entirely isolated, boasting its own high-bandwidth memory, cache, and compute cores. This not only ensures a consistent QoS for served workloads but also democratises access to accelerated computing resources for all users. There are many different ways to optimise the utilisation of NVIDIA GPUs, and more information can be found in the NVIDIA Multi-Instance GPU User Guide. Multi-Instance GPU is available in Canonical Kubernetes via the GPU add-on.
Canonical Kubernetes integration with NVIDIA Network Operator
Modern AI workloads, operating at data centre scale and exhibiting unique characteristics, require high-speed and low-latency network connectivity between GPU servers to run efficiently. An analog to the GPU Operator, the NVIDIA Network Operator brings a notable edge to AI/ML network operations by streamlining the deployment, management, and scaling of GPU-accelerated network functions within Kubernetes. This is primarily due to its ability to automate complex network deployment and configuration tasks that would otherwise require manual work, ensuring consistent and error-free rollouts. It loads the required drivers, libraries, device plugins, and CNIs on any cluster node with an NVIDIA network interface.
Paired with the GPU Operator, the Network Operator enables GPUDirect RDMA (GDR), a key technology that accelerates cloud-native AI workloads by orders of magnitude. GDR allows for optimised network performance, by enhancing data throughput and reducing latency. Another distinctive advantage is its seamless compatibility with NVIDIA’s ecosystem, ensuring a cohesive experience for users. Furthermore, its design, tailored for Kubernetes, ensures scalability and adaptability in various deployment scenarios. This all leads to more efficient networking operations, making it an invaluable tool for businesses aiming to harness the power of GPU-accelerated networking in their Kubernetes environments.
AI/ML developers can fully take advantage of the Canonical and NVIDIA offerings with NVIDIA DGX systems especially knowing that the software is fully validated as part of the NVIDIA DGX-Ready Software program. Using accelerated computing, the processing power is enhanced by shifting tasks from the CPU to the GPU. This creates GPU-optimised instances for AI/ML applications, thereby reducing latency and allowing more data processing, freeing for larger-scale applications and more complex model deployment.
Get started with Canonical Kubernetes and NVIDIA technology
- GPU acceleration, using GPU workers
- NVIDIA integration GPU operator and MIG
- Solution brief: Kubernetes by Canonical delivered on NVIDIA DGX systems