Re: [RFC PATCH 2/2] livepatch: Implement livepatch hybrid mode

From: Yafang Shao
Date: Fri Feb 07 2025 - 22:39:32 EST


On Sat, Feb 8, 2025 at 12:59 AM Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
>
> On Fri, Feb 07, 2025 at 11:16:45AM +0800, Yafang Shao wrote:
> > On Fri, Feb 7, 2025 at 10:31 AM Josh Poimboeuf <jpoimboe@xxxxxxxxxx> wrote:
> > > Why does this happen?
> >
> > It occurs during the KLP transition. It seems like the KLP transition
> > is taking too long.
> >
> > [20329703.332453] livepatch: enabling patch 'livepatch_61_release6'
> > [20329703.340417] livepatch: 'livepatch_61_release6': starting
> > patching transition
> > [20329715.314215] rcu_tasks_wait_gp: rcu_tasks grace period 1109765 is
> > 10166 jiffies old.
> > [20329737.126207] rcu_tasks_wait_gp: rcu_tasks grace period 1109769 is
> > 10219 jiffies old.
> > [20329752.018236] rcu_tasks_wait_gp: rcu_tasks grace period 1109773 is
> > 10199 jiffies old.
> > [20329754.848036] livepatch: 'livepatch_61_release6': patching complete
>
> How specifically does the KLP transition trigger rcu_tasks workings?

I believe the reason is the livepatch transition holds the spinlock
tasklist_lock too long. Though I haven't tried to prove it yet.

>
> > Before the new atomic replace patch is added to the func_stack list,
> > the old patch is already set to nop. If klp_ftrace_handler() is
> > triggered at this point, it will effectively do nothing—in other
> > words, it will execute the original function.
> > I might be wrong.
>
> That's not actually how it works. klp_add_nops() probably needs some
> better comments.

With Petr's help, I now understand how it works.

What do you think about adding the following comments? These comments
are copied from Petr's reply [0].

diff --git a/kernel/livepatch/core.c b/kernel/livepatch/core.c
index f2071a20e5f0..64a026af53e1 100644
--- a/kernel/livepatch/core.c
+++ b/kernel/livepatch/core.c
@@ -559,6 +559,9 @@ static int klp_add_object_nops(struct klp_patch *patch,
* Add 'nop' functions which simply return to the caller to run
* the original function. The 'nop' functions are added to a
* patch to facilitate a 'replace' mode.
+ *
+ * The nop entries are added only for functions which are currently
+ * livepatched but they are no longer livepatched in the new livepatch.
*/
static int klp_add_nops(struct klp_patch *patch)
{


[0]. https://lore.kernel.org/live-patching/CALOAHbBs5sfAxSw4HA6KwjWbH3GmhkHJqcni0d4iB7oVZ_3vjw@xxxxxxxxxxxxxx/T/#m96263bd4e0b2a781e5847aee4fe74f7a17ed186c

>
> It adds nops to the *new* patch so that all the functions in the old
> patch(es) get replaced, even those which don't have a corresponding
> function in the new patch.
>
> The justification for your patch seems to be "here are some bugs, this
> patch helps work around them", which isn't very convincing.

The statement "here are some bugs, the hybrid model can workaround
them" is correct. However, the most important part is "This selective
approach would reduce unnecessary transitions," which will be a
valuable improvement to the livepatch system.

In the old 4.19 kernel, we faced an issue where we had to unload an
already-loaded livepatch and replace it with a new one. Without atomic
replace, we had to first unload the old livepatch (along with others)
and then load the new one, which was less than ideal. In the 6.1
kernel, atomic replace is supported, and it works perfectly for that
situation. This is why we prefer using the atomic replace model over
the old one.

However, atomic replace is not entirely ideal, as it replaces all old
livepatches, even if they are not relevant to the new changes. In
reality, we should not replace livepatches unless they are relevant.
We only need to replace the ones that conflict with the new livepatch.
This is where the hybrid model excels, allowing us to achieve this
selective replacement.

> Instead we
> need to understand the original bugs and fix them.

Yes, we should continue working on fixing them.

--
Regards

Yafang