Skip to main content

Your submission was sent successfully! Close

Thank you for signing up for our newsletter!
In these regular emails you will find the latest updates from Canonical and upcoming events where you can meet our team.Close

Thank you for contacting us. A member of our team will be in touch shortly. Close

  1. Blog
  2. Article

robgibbon
on 15 October 2024


Apache Spark is a popular framework for developing distributed, parallel data processing applications. Our solution for Apache Spark on Kubernetes has made significant progress in the past year since we launched, adding support for Apache Iceberg, a new GPU accelerated image using the NVIDIA Spark-RAPIDS plugin, and support for the Volcano Kubernetes workload scheduler.

A data warehouse, on a cloud-native data lake with Apache Kyuubi

We’ve also been busy adding initial support for Apache Kyuubi to our Charmed Spark solution, so that you can deploy an enterprise-grade, fault-tolerant, ANSI-SQL-compliant data warehouse on your Kubernetes data lake infrastructure, building a so-called ‘lakehouse’. You can deploy a comprehensive, hyper-automated data lake infrastructure using our all-open source control plane, software defined storage and cloud-native compute infrastructure solutions. We’ve even built a couple of runbooks that should get you started in both cloud and on-premise contexts.

There are many benefits to adopting the cloud-native approach to building a data lake:

  • Disaggregated storage means you can scale and manage your storage tier independently of your compute tier, shut down or scale down your compute tier when not being used, and take advantage of cost-optimized object storage systems for hosting your big data.
  • Using cloud-native technologies for the compute tier (ie. Kubernetes) assures a high level of portability between infrastructure providers, so you need not be locked in to any one cloud service provider or data center systems vendor.
  • Using a cloud-native approach means you can build clusters both on the cloud and in your on-premise facility, but use the same operational management approach in a consistent way.
  • You can potentially “bin pack” other applications onto the same cloud-native platform as your data lake infrastructure, for more efficient resource utilization, if you so wish.
  • Do you want to use GPUs to accelerate your Spark applications? Kubernetes has excellent support for exposing GPUs to Spark and can greatly simplify setup of this useful acceleration feature.

While we do have some work to do until our Kyuubi integration is fully ready for business, you can already try it out – see our docs for the lowdown.

Spark 4.0 beta – the new features of tomorrow’s Spark, today

Another thing I’ve been itching to announce is our new Spark 4 beta image. This new beta image joins our collection of Spark 3 images – and whilst the beta image isn’t eligible for official support from Canonical, it gives you an easy way to try out the latest upstream Apache Spark 4 beta features today!

Some of the new features of Spark 4 include:

  • The new Spark Connect API simplifies writing applications that connect to a remote spark cluster, and includes support for the Python, Java, Scala, Golang and Rust languages.
  • ANSI SQL is enabled by default.
  • A new Python-based data source API to simplify creating data connectors for Spark using Python. This opens up developing connectors to software engineers that don’t wish to learn or use the Scala language.
  • Python based UDTFs (User-defined Table Functions) enable users to create custom functions that they can use in queries, similar to a UDF in a more traditional database management system.

There are some cool new things there – well oriented towards advanced data management at whopping scale – so if you’d like to take them for a spin, head on over to our user docs to learn how to quickly set up our Charmed Spark solution for Apache Spark on Kubernetes.

Preview today using Charmed Spark and our Spark container image

You can freely access our Apache Spark 4 beta container image in Github Container Registry right here – 

https://github.com/canonical/charmed-spark-rock/pkgs/container/charmed-spark/280005099?tag=4.0-22.04_edge

If you’d like to learn more about getting enterprise-grade support for Apache  Spark from Canonical, contact us and we’ll be happy to jump on a call with you to discuss further, or you can browse our Charmed Spark product page if you prefer.

Related posts


robgibbon
15 July 2024

Deploying and scaling Apache Spark on Amazon AWS EKS

Data Platform Article

Move over Hadoop, it’s time for Spark on Kubernetes Apache Spark, a framework for parallel distributed data processing, has become a popular choice for building streaming applications, data lake houses and big data extract-transform-load data processing (ETL). It is horizontally scalable, fault-tolerant, and performs well at high scale. H ...


robgibbon
23 May 2024

Can it play Doom? Running an AI LAN party on a Spark cluster with ViZDoom

AI Article

It’s all about AI these days, so I decided to try and answer the important question: can you make a Spark cluster run AI agents that play a game of Doom, in a multiplayer LAN party? Although I’m no data scientist, I was able to get this to work and I’ll show you how so ...


robgibbon
17 October 2023

Why we built a Spark solution for Kubernetes

Data Platform Article

We’re super excited to announce that we have shipped the first release of our solution for big data – Charmed Spark. Charmed Spark packages a supported distribution of Apache Spark and optimises it for deployment to Kubernetes, which is where most of the industry is moving these days. Reimagining how to work with big data ...