Re: [RFC PATCH 0/2] livepatch: Add support for hybrid mode

From: Petr Mladek
Date: Tue Feb 04 2025 - 08:28:48 EST


On Mon 2025-02-03 17:44:52, Yafang Shao wrote:
> On Fri, Jan 31, 2025 at 9:18 PM Miroslav Benes <mbenes@xxxxxxx> wrote:
> >
> > > >
> > > > + What exactly is meant by frequent replacements (busy loop?, once a minute?)
> > >
> > > The script:
> > >
> > > #!/bin/bash
> > > while true; do
> > > yum install -y ./kernel-livepatch-6.1.12-0.x86_64.rpm
> > > ./apply_livepatch_61.sh # it will sleep 5s
> > > yum erase -y kernel-livepatch-6.1.12-0.x86_64
> > > yum install -y ./kernel-livepatch-6.1.6-0.x86_64.rpm
> > > ./apply_livepatch_61.sh # it will sleep 5s
> > > done
> >
> > A live patch application is a slowpath. It is expected not to run
> > frequently (in a relative sense).
>
> The frequency isn’t the main concern here; _scalability_ is the key issue.
> Running livepatches once per day (a relatively low frequency) across all of our
> production servers (hundreds of thousands) isn’t feasible. Instead, we need to
> periodically run tests on a subset of test servers.

I am confused. The original problem was a system crash when
livepatching do_exit() function, see
https://lore.kernel.org/r/CALOAHbA9WHPjeZKUcUkwULagQjTMfqAdAg+akqPzbZ7Byc=qrw@xxxxxxxxxxxxxx

The rcu watchdog warning was first mentioned in this patchset.
Do you see rcu watchdog warning in production or just
with this artificial test, please?


> > If you stress it like this, it is quite
> > expected that it will have an impact. Especially on a large busy system.
>
> It seems you agree that the current atomic-replace process lacks scalability.
> When deploying a livepatch across a large fleet of servers, it’s impossible to
> ensure that the servers are idle, as their workloads are constantly varying and
> are not deterministic.

Do you see the scalability problem in production, please?
And could you prove that it was caused by livepatching, please?


> The challenges are very different when managing 1K servers versus 1M servers.
> Similarly, the issues differ significantly between patching a single
> function and
> patching 100 functions, especially when some of those functions are critical.
> That’s what scalability is all about.
>
> Since we transitioned from the old livepatch mode to the new
> atomic-replace mode,

What do you mean with the old livepatch mode, please?

Did you allow to install more livepatches in parallel?
What was the motivation to switch to the atomic replace, please?

> our SREs have consistently reported that one or more servers become
> stalled during
> the upgrade (replacement).

What is SRE, please?
Could you please show some log from a production system?


> > > > > Other potential risks may also arise
> > > > > due to inconsistencies or race conditions during transitions.
> > > >
> > > > What inconsistencies and race conditions you have in mind, please?
> > >
> > > I have explained it at
> > > https://lore.kernel.org/live-patching/Z5DHQG4geRsuIflc@xxxxxxxxxxxxxxx/T/#m5058583fa64d95ef7ac9525a6a8af8ca865bf354
> > >
> > > klp_ftrace_handler
> > > if (unlikely(func->transition)) {
> > > WARN_ON_ONCE(patch_state == KLP_UNDEFINED);
> > > }
> > >
> > > Why is WARN_ON_ONCE() placed here? What issues have we encountered in the past
> > > that led to the decision to add this warning?
> >
> > A safety measure for something which really should not happen.
>
> Unfortunately, this issue occurs during my stress tests.

I am confused. Do you see the above WARN_ON_ONCE() during your
stress test? Could you please provide a log?

> > > > The main advantage of the atomic replace is simplify the maintenance
> > > > and debugging.
> > >
> > > Is it worth the high overhead on production servers?
> >
> > Yes, because the overhead once a live patch is applied is negligible.
>
> If you’re managing a large fleet of servers, this issue is far from negligible.
>
> >
> > > Can you provide examples of companies that use atomic replacement at
> > > scale in their production environments?
> >
> > At least SUSE uses it as a solution for its customers. No many problems
> > have been reported since we started ~10 years ago.
>
> Perhaps we’re running different workloads.
> Going back to the original purpose of livepatching: is it designed to address
> security vulnerabilities, or to deploy new features?

We (SUSE) use livepatches only for fixing CVEs and serious bugs.


> If it’s the latter, then there’s definitely a lot of room for improvement.

You might be right. I am just not sure whether the hybrid mode would
be the right solution.

If you have problems with the atomic replace then you might stop using
it completely and just install more livepatches in parallel.


My view:

More livepatches installed in parallel are more prone to
inconsistencies. A good example is the thread about introducing
stack order sysfs interface, see
https://lore.kernel.org/all/AAD198C9-210E-4E31-8FD7-270C39A974A8@xxxxxxxxx/

The atomic replace helps to keep the livepatched functions consistent.

The hybrid model would allow to install more livepatches in parallel except
that one livepatch could be replaced atomically. It would create even
more scenarios than allowing all livepatches in parallel.

What would be the rules, please?

Which functionality will be livepatched by the atomic replace, please?

Which functionality will be handled by the extra non-replaceable
livepatches, please?

How would you keep the livepatches consistent, please?

How would you manage dependencies between livepatches, please?

What is the advantage of the hybrid model over allowing
all livepatches in parallel, please?

Best Regards,
Petr