Your submission was sent successfully! Close

Thank you for signing up for our newsletter!
In these regular emails you will find the latest updates from Canonical and upcoming events where you can meet our team.Close

Thank you for contacting our team. We will be in touch shortly.Close

  1. Blog
  2. Article

robgibbon
on 17 October 2023

Why we built a Spark solution for Kubernetes


We’re super excited to announce that we have shipped the first release of our solution for big data – Charmed Spark. Charmed Spark packages a supported distribution of Apache Spark and optimises it for deployment to Kubernetes, which is where most of the industry is moving these days.

Reimagining how to work with big data

Having the opportunity to rethink how big data is processed meant that we could challenge the status-quo based on the more traditional Hadoop YARN stack. And with our Charmed Kubernetes and MicroK8s systems, we have a greatly simplified, yet powerful family of cluster managers to enable full stack deployment of big data clusters and a consistent user-experience across local, on-premises and cloud environments. Of course, you’re not limited to our Kubernetes distributions – you can use Charmed Spark on other conformant Kubernetes – for example AWS EKS.

And reimagining big data storage too

For storage we chose S3 API-compliant Ceph instead of Hadoop HDFS storage system, although the solution is designed to work with most S3 compatible scale-out storage solutions. HDFS has many problems such as its NameNode with the entire inode map of the big data filesystem held in Java heap or the NameNode’s active/passive failover architecture. We opted to sidestep these and adopt more contemporary object storage solutions as the preferred backing tier for our solution. With modern, high capacity networking (for example > 100GbE), bits can typically be shifted to and from the Spark cluster faster than they can be processed by the Spark cluster, so the HDFS design paradigm of bringing the compute problem to the data makes less sense nowadays. Of course, users can still connect Charmed Spark to HDFS if they so wish.

Simplifying operations

In terms of operations, we wanted to keep the user experience as true to upstream Apache Spark as possible, so that users can drop in our runtime as a replacement for the upstream Spark Kubernetes container image with minimal fuss. CLI commands like `spark-submit`, `pyspark` and `spark-shell` work exactly as you would expect. We provide an Ubuntu snap package with client tools to help you get started quickly and easily and this can be installed on the edge nodes of your big data cluster.

The snap package also includes our spark8t Python library and CLI for managing service accounts and profiles for jobs on your Kubernetes cluster. Our aim with this tool is to make the lives of cluster admins, data engineers and data scientists a little bit easier by allowing them to preconfigure Spark job settings for different types of workloads and for the different Kubernetes service accounts that the jobs will run under.

We also offer a Juju Charm for Spark History Server. A Juju Charm is like a copilot for an application, and it contains codified knowledge about how to operate it. This one helps you to deploy and operate the Spark History Server on Kubernetes in a straightforward way. Read the docs to get started. Juju is a powerful system for day-2 operational management of complex distributed systems on clouds and on Kubernetes. We’ll be adding more Juju Charms to our Spark solution that cover more functionality over time.

Get working fast

We’ve integrated JupyterLab into the Charmed Spark solution, so that you can conveniently spin up a Jupyter environment on an edge node using Docker and have it start a spark session on your MicroK8s cluster, to make it even easier to work with Spark on K8s. Learn how to use JupyterLab with Charmed Spark.

The full documentation suite for Charmed Spark is available at canonical.com/data/docs/spark/k8s and we also have a reference architecture guide that you can download.

Good to know – we offer enterprise grade paid support on the entire solution through our Ubuntu Pro + Support subscription which covers up to 10 years of break/fix support and security maintenance per major release, in line with our wider commitment to long term support. If you’re interested in learning more, contact our sales team via the form or call us. We can also offer help with solution deployment through our fixed-fee deployment services – learn more here. Community support is available via our chat server and our community forum.

More Data Fabric solutions to come

Charmed Spark is actually the first in a series of system solutions for data management that we’ll be releasing over the coming months. You can sign up for the beta program and try out tomorrow’s awesome tech today.

If you’ll be at Gitex Dubai or KubeCon North America this Autumn, you’re welcome to come to the Ubuntu booth (that’s Booth B31 DevSlam for Gitex and Booth A2 for KubeCon) and meet me in person; I’ll be pleased to discuss how we can help you to accelerate innovation.

Related posts


robgibbon
3 May 2023

Big data security foundations in five steps

Data Platform Article

We’ve all read the headlines about spectacular data breaches and other security incidents, and the impact that they have had on the victim organisations. And in some ways there’s no place more vulnerable to attack than a big data environment like a data lake. ...


robgibbon
23 May 2024

Can it play Doom? Running an AI LAN party on a Spark cluster with ViZDoom

AI Article

It’s all about AI these days, so I decided to try and answer the important question: can you make a Spark cluster run AI agents that play a game of Doom, in a multiplayer LAN party? Although I’m no data scientist, I was able to get this to work and I’ll show you how so ...


robgibbon
3 July 2023

Charmed Spark beta release is out – try it today

AI Article

The Canonical Data Fabric team is pleased to announce the first beta release of Charmed Spark, our solution for Apache Spark. Apache Spark is a free, open source software framework for developing distributed, parallel processing jobs. It’s popular with data engineers and data scientists alike when building data pipelines for both batch an ...