Your submission was sent successfully! Close

Thank you for signing up for our newsletter!
In these regular emails you will find the latest updates from Canonical and upcoming events where you can meet our team.Close

Thank you for contacting our team. We will be in touch shortly.Close

  1. Blog
  2. Article

on 3 May 2023

AI generated image

We’ve all read the headlines about spectacular data breaches and other security incidents, and the impact that they have had on the victim organisations. From LastPass to SolarWinds, “data security” seems to be the phrase on the lips of every CTO these days. And in some ways there’s no place more vulnerable to attack than a big data environment like a data lake.

From the vault

Data intensive systems have been the target of countless attacks. Some of the most memorable technical exploits include Log4Shell, Heartbleed and ShellShock.

In the Log4Shell incident, it was discovered that a remote backdoor could be opened that granted the attacker command line access on the target system via certain versions of the popular and widely used log4j logging library, where the system would be tricked into calling back to the attacker’s destination of choice.  The vulnerability was assigned a 10/10 CVSS score by the Apache Software Foundation – who are the upstream maintainers of Log4j – the highest possible rating. Obviously this caused widespread havoc with active exploitation of the vulnerability taking place worldwide for an extended period.

Heartbleed was an exploit in older versions of the widely used OpenSSL library, that allowed attackers to negotiate a downgraded version of TLS encryption for communication between the attacker and the remote server, and the exploit could ultimately lead to the attacker being able to intercept communications. The vulnerability was widely exploited and was a key enabler in the compromise of numerous critical data environments including the patient records of 4.5m US citizens.

With ShellShock, attackers were trivially able to exploit a vulnerability in the Linux bash shell to “shock” their way in and gain total control over the target system. In certain scenarios this could be achieved remotely – for example if the system had a web server with CGI script support enabled. Again the vulnerability was widely exploited by attackers worldwide.

However attackers can have many intentions and their attacks can come in many forms. Let’s explore some of those.

Cryptojacking worms

Cryptojacking is when an attacker exploits the target system to run unauthorised cryptocurrency mining tasks for currencies like Monero. Big data systems like Hadoop, Kubernetes, MongoDB and ElasticSearch have all been the targets of cryptojacking worms like PsMiner – viral malware that exploits system vulnerabilities to spread without user interaction –  to perform cryptocurrency mining at scale using the parallel, distributed resources of the cluster. Whilst cryptojacking malware may not immediately expose your data to the risk of a breach (although often the software is modular, and other modules might), the malware will be abusing your cluster’s resources at scale in order to benefit the attacker.


Crypto-ransomware is another next-level fear for CTOs everywhere. In this kind of attack, the victim’s computer file systems and databases are encrypted by malware. The attacker then extorts a ransom in order to decrypt the victim’s data. Past examples included CryptoLocker and WannaCry.

APTs and [corporate] espionage

Aside from data theft, user extortion and resource abuse, Advanced Persistent Threats (APTs for short) are attackers who infiltrate an organisation’s networks on a long term basis. There may be many reasons to do this, but a common reason is espionage – whether corporate espionage in a commercial context; or real spying in the public, NGO, non-profit and legal sectors.

When data from across the organisation is aggregated in a data lake or similar big data system, it can make a highly appealing target for APT intelligence gathering activities and an interesting source of data for exfiltration.

Data breach

A data breach is when data is disclosed in an unauthorised manner. A data breach can range in scale, for example – mistakenly publishing personally identifiable information on a company intranet. Or maliciously exposing the financial records of tens of millions of banking customers.

The aggregate nature of data held in big data environments like data lakes makes it an ideal target for a large-scale data breach, with some kinds of dataset like customer credit card data or identity data being of particular value to criminals.

Introducing STRIDE

There’s a lot at stake, and even if your data lake or other big data solution isn’t handling sensitive data, there is still a risk that you’ll fall victim to one or other of the attacks noted above if you don’t take sufficient measures to adequately secure the data processing environment.

One thing that you can do to prepare your business is to identify security requirements surrounding the services that you operate or consume, like your data lake. A good place to start is by threat modelling, and one of the most popular frameworks for that is STRIDE. 

STRIDE is a threat assessment framework originally conceived in the late 1990s that you can use to help you make informed and prioritised decisions about the security controls you need to implement to protect your data. STRIDE is an acronym. The letters of the word stand for:

  • S = Spoofing, which means tricking the target system by pretending to be a different user.
  • T = Tampering, which means altering the target system or things like system logs, or the data the system manages.
  • R = Repudiation, which means removing any traces or evidence that the target system has been illegitimately accessed or tampered with.
  • I = Information disclosure, which means getting the target system to disclose information, such as the data that it manages, in an unauthorised manner.
  • D = Denial of service, which means making the target system unavailable – for example by crashing the target system or by overloading the system with spurious requests.
  • E = Elevation of privilege, which means gaining additional unauthorised access privileges on the target system, for example “getting to root”.

The process that teams need to follow for a STRIDE assessment typically starts by building a set of system diagrams that explain how the system works and how it integrates with other systems in its operating context. The team responsible for the system then meets to discuss and analyse the system diagrams. They will work together to identify and write down any vulnerabilities that they can think of in the system’s design that could be exploited by an attacker, and categorising them according to the classification described by the acronym  STRIDE.

Any issues found are then prioritised, for example using the MoSCoW notation for requirement prioritisation. MoSCoW is an acronym that stands for:

  • M=Must do
  • S=Should do
  • C=Could do
  • W=Would be nice

Once triaged, the team works on devising fixes for the prioritised issues. The entire STRIDE assessment process is repeated on a regular basis, for example prior to each release.

There are a few other threat assessment frameworks similar to STRIDE, so if your organisation uses another one, that’s fine too.

Aside from following STRIDE or a similar threat assessment process for your environment, we list out five fundamental best-practice security controls below that can help to secure any large scale data processing environment.

Step 1: Secure the foundations of the platform

It is essential to ensure that the source of the software you use is trustworthy and well maintained. Running with poorly maintained software packages, old container images and unpatched operating system bits, like kernels and libraries, is an open invitation to attackers. As a matter of good practice, you should only be downloading your software from selected, trustworthy sources and mirrors.

If your data lake or other data management facility is founded on Ubuntu Server LTS, you can benefit from better fundamental security by ensuring your servers are receiving the latest security patches and updates for all the software you use from both the Ubuntu Main and the Ubuntu Universe repositories.

Ubuntu has always delivered up to five years of security fixes for packages in the Ubuntu Main repository at no cost. With Ubuntu Pro, that commitment is extended to up to 10 years, and also covers packages in the Universe repository.  

Ubuntu Pro allows you to register up to five systems for free. For large scale compute clusters, you can readily buy Ubuntu Pro activation tokens from the online store. Or, if you’re running on Amazon AWS, Microsoft Azure or Google Cloud you can get Ubuntu Pro from the launch console, with hourly billing direct to your cloud usage bill. Best of all, there’s no need to migrate your servers – you can activate Ubuntu Pro on your existing Ubuntu deployments without migrating.

Learn more about patching strategies for Ubuntu.

Step 2: Secure ingress and egress points with zones, firewalls and proxies

It might seem obvious, but a surprising number of production database servers and data lake environments are exposed and accessible directly from the internet. If you leave private databases exposed and publicly accessible, it is likely just a matter of time before your environment becomes the victim of a data breach.

You can help to mitigate this risk by implementing zoning using firewalls and network segmentation within the network architecture that underpins your data lake. Firewalls are devices, appliances or applications that regulate network traffic flows to allow or disallow traffic according to rules you define.

Network segmentation is a way of splitting up your network into zones using technologies like VLAN (Virtual Local Area Network). Network architects will typically use firewalls to regulate traffic between the zones that have been defined. Each zone is typically associated with a level of risk. A common setup would be to define three zones – Untrusted, DMZ (Demilitarized Zone), and Private. For data lake environments, there is often a fourth zone defined – Backend.

  • The Untrusted Zone typically includes networks and network elements on public networks like the internet.
  • Endpoints and network elements in the DMZ are usually able to accept connections from the Untrusted Zone, and initiate connections to the Private Zone.
  • The Private Zone usually hosts endpoints and network elements that cannot initiate connections outside the zone. However, if a Backend Zone is implemented, then network architects may allow endpoints in the Private Zone to initiate limited connections to services in the Backend Zone.
  • Backend Zones are typically implemented to isolate highly sensitive services like a data lake or other data management environment from other network flows. Typically, connections may not be initiated from the backend zone to any of the other zones.

Where a service or endpoint needs to be able to connect to the internet or other resources in the Untrusted Zone, sometimes network architects will deploy a proxy server to mediate and regulate access. Proxy servers can be configured with allow and deny lists in order to restrict access to resources in other zones. A well configured proxy server can help to slow or contain the spread of malware, for example by alerting operators when unusual URI requests are being made and by closing down access to known and unknown URIs that are typically used as C2 (Command and Control) servers for remotely delivering instructions to the malware.

Reverse proxy servers are also frequently deployed in the DMZ, in order to mediate and regulate requests initiated from the Untrusted Zone. Reverse proxy servers can also help to manage load on backend services, and can act as a circuit breaker to help defend against intentional Denial of Service attacks or other surges in network traffic.

Step 3: Use encryption at rest and in transit everywhere

End-to-end encryption in transit means that all network communications take place over an encrypted transport such as TLS, which can significantly help to protect the confidentiality of your data, even when all services are running within your own network. Implementing end-to-end encryption makes it harder for unauthorised users (which can include staff; not only APTs and other outside attackers) to compromise your data. Data can include credentials and user data – not only the data in your data lake – which can then be used for further attacks against your environment, such as elevation of privilege.

Your DNS lookups can also be secured using DNS over TLS (DoT). See the Ubuntu manpages for guidance on how to enable it.

Encryption at rest can mean securing persistent storage media like nearline SAS drives and NVME/SSD block devices through the use of encrypted filesystems, or it can mean using application-level symmetric encryption to encrypt the data files in your storage system, for example a Ceph object store, Amazon Simple Storage Service (S3), or Hadoop HDFS. Encryption at rest can help to protect your data from being read by an attacker now, or in the future – so long as your chosen algorithm uses a sufficiently strong key length. The Center for Internet Security (CIS) is a good place to start for guidance on implementing encryption at rest for your chosen technologies and environment.

Step 4: Implement authentication and authorisation

Authentication means the process of verifying that a user is who they claim to be. If a user claims to be the system administrator, and the system has no controls, or weak controls to verify their assertion, obviously it means that the user will likely be able to perform the tasks of a system administrator, whether they are entitled to do so or not. Without authentication, that user could access sensitive data managed by the system; and create, delete or modify system files (including software binaries). They could also modify system audit logs to remove traces of what they’ve done.

Authorisation means the process of verifying that an (authenticated) user is entitled to perform the action that they are trying to perform. Typical authorisation frameworks can include filesystem permissions, MACLs (Mandatory Access Control Lists) – for example to restrict access to sensitive resources and kernel calls; and RBACS (Role-Based Access Control Systems) – which typically define roles that associated users perform on the system. Depending on the role, users will be entitled to access differing functions within the system.

According to the Verizon data breach investigations report, 60% of attacks involve the use of stolen credentials, with 93% of data breaches successfully compromising datasets held in online systems like data lakes and databases. Therefore it is important to reduce attack surface areas and centralise identity controls. Organisations can then federate authentication and authorisation across their systems using protocols like Kerberos, Oauth and OpenID Connect (OIDC).

Step 5: Build a business continuity plan

Whilst there are many more advanced techniques and controls that you can implement to build Defense in Depth for your big data environment (check out Security Operations Automated Response – SOAR, for example), one final foundational step that will help in pretty much every case is to assemble a business continuity plan, or BCP.

A BCP is a plan of action that should cover all aspects of your business – not only your data lake. It should cover things like how you will ensure staff can keep working if a disaster strikes (for example, if your place of business burns down). But it also needs to cover events like data breaches or crypto-ransom. With a well defined, and well-rehearsed plan of action in place, your teams will know exactly what to do in the heat of the crisis, when it inevitably happens. By knowing what has to be done and by whom, any damage can be efficiently contained, limited and managed.

Sum up

We’ve taken quite a tour! We learned about different types of attack vectors on big data environments – things like cryptojacking, data breach, APT espionage, and crypto-ransom, as well as some classic vulnerabilities that have been exploited by hackers in the past to make widespread havoc.

We learned about the STRIDE threat assessment framework and how you can use it to assess your data lake environment for vulnerabilities and issues, and use the findings to prioritise fixes.

And lastly, we explored five different controls you can implement to help limit, contain and prevent the bad guys from compromising your service:

  • Secure the source of your software, for example by configuring your servers to benefit from Ubuntu Pro expanded coverage and extended security maintenance
  • Regulate the network traffic with zoning, firewalls and proxies
  • Use encryption at rest and in transit, everywhere
  • Implement authentication and authorisation
  • Build a business continuity plan

Closing thoughts

It’s a big wide world out there, and bad actors will take advantage of a weak security posture if they can. Implementing good foundational security measures can help to prevent that, or limit the damage. However, it’s important to remember that technical security controls alone are likely insufficient – good security posture is a combination of well trained, security aware people and solid operational procedures, as well as effective technical measures and safeguards.

The human factor – whether through misconfiguration or operator error; or through espionage and insider jobs, is the leading cause of security incidents. Adopting software operators that automate the management of your services on cloud infrastructure and on Kubernetes can significantly reduce both the attack surface and the likelihood of human actions – whether intentional or by mistake – leading to system compromise.

Sign up for the Securing Apache Spark operations webinar

Now read: 7 approaches to accelerating Apache Kafka on K8s

Related posts

23 May 2024

Can it play Doom? Running an AI LAN party on a Spark cluster with ViZDoom

AI Article

It’s all about AI these days, so I decided to try and answer the important question: can you make a Spark cluster run AI agents that play a game of Doom, in a multiplayer LAN party? Although I’m no data scientist, I was able to get this to work and I’ll show you how so ...

17 October 2023

Why we built a Spark solution for Kubernetes

Data Platform Article

We’re super excited to announce that we have shipped the first release of our solution for big data – Charmed Spark. Charmed Spark packages a supported distribution of Apache Spark and optimises it for deployment to Kubernetes, which is where most of the industry is moving these days. Reimagining how to work with big data ...

3 July 2023

Charmed Spark beta release is out – try it today

AI Article

The Canonical Data Fabric team is pleased to announce the first beta release of Charmed Spark, our solution for Apache Spark. Apache Spark is a free, open source software framework for developing distributed, parallel processing jobs. It’s popular with data engineers and data scientists alike when building data pipelines for both batch an ...