Skip to main content

Your submission was sent successfully! Close

Thank you for signing up for our newsletter!
In these regular emails you will find the latest updates from Canonical and upcoming events where you can meet our team.Close

Thank you for contacting our team. We will be in touch shortly.Close

  1. Blog
  2. Article

Canonical
on 26 March 2021

Livepatch 2021-03-24 incident investigation report


Description

A defective livepatch for kernel 4.4 in Ubuntu 16.04 LTS (Xenial) was not caught in internal testing processes because the defect was a race condition, triggered by workload-specific behaviour, under load.

The livepatch would cause the madvise system call to block indefinitely, and thus cause lockup to the processes using the call. These conditions were not replicated in our test environment.

After passing internal testing, the livepatch was published to our free tier users (typically personal systems). Canonical services also run in this tier as an early warning system, and the defect was noticed at that stage. Customers with systems configured for this tier were also impacted. The livepatch causing the defect was retracted one and a half hours after publication, however the standard update process is designed to patch all online systems within one hour.

The faulty livepatch was addressing a Medium severity CVE (CVE-2020-29372). This CVE fix came in as part of our normal SRU processes. The livepatch was tested in combination with an embargoed high severity CVE, and at no time did we see any issues with systems as we tested the combined livepatch. As part of the analysis of the lockup we have found that systems that are running under load may obtain a lock that is not handled correctly after the livepatch is applied. If the lock was obtained after the livepatch was applied then all is fine. Linux kernel livepatching is a complex process that we stabilize through our testing, and our tier deployment process. 

In addition to our internal testing processes, we follow a tiered deployment policy that releases livepatches to our customers, to reduce the risks associated with kernel livepatching. Customers should only receive a livepatch after all the internal embargoed test systems and the free tier, successfully apply it.

Root cause analysis

The faulty livepatch was addressing a Medium severity CVE (CVE-2020-29372) and progressed through the following tiers of our process, (a) testing, (b) internal deployment [tier name: proposed], and (c) free subscription deployment [tier name: updates], where the problem with the patch was identified. The livepatch was removed and never reached the customers’ tier [tier name: stable].

The problem of the livepatch failure was that it would fail under certain race conditions that appear during load, causing the madvise system call, that was patched, to block indefinitely. In turn that caused lockup to the processes using the call. We have found that systems that are running under load may obtain a lock that is not handled correctly after the livepatch is applied. If the lock was obtained after the livepatch was applied then no issue was observed. The systems that the livepatch manifested the misbehavior were not in a quiescent state when the livepatch was applied.

The offending livepatch passed our testing, because of the following reasons:

  • Our testing is done on newly provisioned systems, and the conditions of reaching the lockup required a running system that had processes with the offending lock on a different state.

The internal deployment testing of the livepatch at the ‘proposed’ tier has failed because there is no workload that could simulate failure conditions.

The impact on the free subscription tier was high because:

  • There is no mechanism in place to gradually deploy within a tier to reduce the impact of a faulty livepatch. All systems in a tier get the patch once every hour; given that the livepatch in question was not removed before one and a half hours had passed, we estimate that all the active free tier systems got the patch.
  • There is no automatic mechanism in place to recover from a livepatch that applies successfully but fails later.
  • The manual mechanisms (see the section “How a customer can fix the issue”), were tedious, and required to disable livepatch service or the offending livepatch will be reapplied after reboot.
  • The token this customer had, was assigned to the free subscription tier.
  • Neither the customer, via the livepatch interface, nor our operations could identify that they were present on the free subscription tier. Our operations are prevented from mapping livepatch tokens to identities as part of the Personal Identification Information protection.

Addressing affected systems

  • Via the grub2 boot menu:
    • Select and boot your backup kernel.
    • The livepatch client will remove the failed patch, and pick up the most recent livepatch for your backup kernel.
  • If kernel command line can be accessed:
    • Turn off the livepatch service from the kernel command line by adding
      systemd.mask=snap.canonical-livepatch.canonical-livepatchd.service
    • Remove the offending livepatch by issuing:
      sudo rm -rf /var/snap/canonical-livepatch/common/payload
    • reboot
  • If the system is running: 
    • Disable livepatch
      snap stop --disable canonical-livepatch
    • Remove the offending livepatch by issuing:
      sudo rm -rf /var/snap/canonical-livepatch/common/payload
    • Reboot

Improvements on the Canonical livepatch system

The recommendations are listed and classified on urgency. It is our intention to implement these recommendations to prevent similar issues from occurring.

  • Immediate
    • Restrict the CVEs addressed by livepatch to critical and high severity; do not include medium CVEs via this system to balance between risk of a faulty patch, and impact on operations.
    • Enhance the internal testing tier [tier: proposed] with production or production replicating systems within Canonical or introduce a new tier in between. Include the status of these systems after the patch application in the tier qualification process to allow the patch to the next tier.
  • Short term
    • Enhance our livepatch testing phase to include long-running systems, and not rely only on newly provisioned ones.
    • Make the removal of potential faulty patches, an easy process to perform in multiple systems and remember. Document it publicly. It is a process that is performed during a high stress period and must be simple for our customers to succeed.
    • Improve the single tier deployment strategy for free subscriptions and customers tiers, to include
      • Not providing the patches to all systems at the same time, to prevent the whole tier from applying a potentially faulty patch within an hour. Roll the patches to different systems with enough delay to protect very large groups of users from getting a potential faulty patch at the same time.
      • In addition to the above, enhance the capabilities of the livepatch client application to not only prevent the application of known “faulty livepatches” but allow setting up various factors for deployment such as working hours, to prevent late night calls to IT systems.
  • Long term
    • Automate the faulty patch detection, alerting, and blacklisting. As this will be a heuristic detection, report and use data from the central service to make this reliable.
    • Automatic detection and notification of customers with subscriptions that are incorrectly enrolled into the free subscription tier.

Related posts


Canonical
19 September 2023

라이브패치(Livepatch)에 새로운 13개월 슬라이딩 지원 기간이 있습니다. 여러분에게 어떤 의미가 있을까요?

Security Security

라이브패치는 시스템을 즉시 재부팅할 필요 없고 런타임에 중요하고 높은 보안 커널 공통 보안 취약성 및 노출(CVE)을 수정하는 유용한 툴입니다. 그러나 정기적인 유지 관리 기간 및 재부팅을 대체하는 용도로 사용해서는 안 됩니다. 좋은 기업 정책에는 시스템이 안정적이고 안전하게 유지되도록 라이브패치와 정기적인 재부팅이 모두 포함되어야 합니다. 그 이유는 펌웨어 또는 장치 드라이버 업데이트와 같은 일부 시스템 CVE는 ...


ijlal-loutfi
13 April 2023

Livepatch has a new 13-month sliding support window – What does it mean for you?

Security Livepatch

The Livepatch tool is a valuable solution for resolving critical and high-security kernel CVEs without requiring an immediate system reboot. However, it is not a substitute for regular maintenance windows and reboots, as some CVEs still require a system reboot. Additionally, Livepatch only covers security-related kernel updates, not non-s ...


ijlal-loutfi
13 April 2023

Canonical Livepatch gets even better – Now supporting Hardware Enablement Kernels

Security Livepatch

Livepatch allows Ubuntu users to fix critical and high kernel vulnerabilities at runtime, which reduces the need for unplanned reboots. Until now, Livepatch has only been available for Long-Term Release (LTS) kernels, but starting with the release of Ubuntu’s interim release of 23.04 Lunar Lobster in April 2023, it will also be available ...