Re: [RFC PATCH 0/2] livepatch: Add support for hybrid mode
From: Yafang Shao
Date: Mon Feb 03 2025 - 04:46:13 EST
On Fri, Jan 31, 2025 at 9:18 PM Miroslav Benes <mbenes@xxxxxxx> wrote:
>
> > >
> > > + What exactly is meant by frequent replacements (busy loop?, once a minute?)
> >
> > The script:
> >
> > #!/bin/bash
> > while true; do
> > yum install -y ./kernel-livepatch-6.1.12-0.x86_64.rpm
> > ./apply_livepatch_61.sh # it will sleep 5s
> > yum erase -y kernel-livepatch-6.1.12-0.x86_64
> > yum install -y ./kernel-livepatch-6.1.6-0.x86_64.rpm
> > ./apply_livepatch_61.sh # it will sleep 5s
> > done
>
> A live patch application is a slowpath. It is expected not to run
> frequently (in a relative sense).
The frequency isn’t the main concern here; _scalability_ is the key issue.
Running livepatches once per day (a relatively low frequency) across all of our
production servers (hundreds of thousands) isn’t feasible. Instead, we need to
periodically run tests on a subset of test servers.
> If you stress it like this, it is quite
> expected that it will have an impact. Especially on a large busy system.
It seems you agree that the current atomic-replace process lacks scalability.
When deploying a livepatch across a large fleet of servers, it’s impossible to
ensure that the servers are idle, as their workloads are constantly varying and
are not deterministic.
The challenges are very different when managing 1K servers versus 1M servers.
Similarly, the issues differ significantly between patching a single
function and
patching 100 functions, especially when some of those functions are critical.
That’s what scalability is all about.
Since we transitioned from the old livepatch mode to the new
atomic-replace mode,
our SREs have consistently reported that one or more servers become
stalled during
the upgrade (replacement).
>
> > >
> > > > Other potential risks may also arise
> > > > due to inconsistencies or race conditions during transitions.
> > >
> > > What inconsistencies and race conditions you have in mind, please?
> >
> > I have explained it at
> > https://lore.kernel.org/live-patching/Z5DHQG4geRsuIflc@xxxxxxxxxxxxxxx/T/#m5058583fa64d95ef7ac9525a6a8af8ca865bf354
> >
> > klp_ftrace_handler
> > if (unlikely(func->transition)) {
> > WARN_ON_ONCE(patch_state == KLP_UNDEFINED);
> > }
> >
> > Why is WARN_ON_ONCE() placed here? What issues have we encountered in the past
> > that led to the decision to add this warning?
>
> A safety measure for something which really should not happen.
Unfortunately, this issue occurs during my stress tests.
>
> > > The main advantage of the atomic replace is simplify the maintenance
> > > and debugging.
> >
> > Is it worth the high overhead on production servers?
>
> Yes, because the overhead once a live patch is applied is negligible.
If you’re managing a large fleet of servers, this issue is far from negligible.
>
> > Can you provide examples of companies that use atomic replacement at
> > scale in their production environments?
>
> At least SUSE uses it as a solution for its customers. No many problems
> have been reported since we started ~10 years ago.
Perhaps we’re running different workloads.
Going back to the original purpose of livepatching: is it designed to address
security vulnerabilities, or to deploy new features?
If it’s the latter, then there’s definitely a lot of room for improvement.
--
Regards
Yafang