Re: [PATCH v3] rethook: Remove the running task check in rethook_find_ret_addr()

From: Tengda Wu

Date: Tue Jun 09 2026 - 07:13:01 EST




On 2026/6/9 17:43, Petr Mladek wrote:
> Added live-patching mailing list.
>
> On Tue 2026-06-09 16:49:53, Tengda Wu wrote:
>> The current check in rethook_find_ret_addr() prevents obtaining a return
>> address when the target task is marked as running. However, this condition
>> is both insufficient for correctness and unnecessary for its intended
>> purpose.
>>
>> The check is inherently racy: a task can begin running on another CPU
>> immediately after task_is_running() returns false, potentially leading to
>> concurrent modification of rethook data structures while the iteration is
>> in progress.
>>
>> Rather than trying to fix this unreliable check deep in the unwinding
>> path, simply remove it. The iteration is already safe from crashes because
>> unwind_next_frame() holds RCU and rethook_node structures are RCU-freed;
>> even if the iteration goes off the rails and returns invalid information,
>> it will not crash. Callers that require consistency must provide a safe
>> context themselves.
>>
>> Fixes: 54ecbe6f1ed5 ("rethook: Add a generic return hook")
>> Acked-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>> Signed-off-by: Tengda Wu <wutengda@xxxxxxxxxxxxxxx>
>> ---
>> v3: Improve commit message: clarify safety semantics and document that RCU guarantees no crash.
>> v2: https://lore.kernel.org/all/20260609005728.458962-1-wutengda@xxxxxxxxxxxxxxx/
>> v1: https://lore.kernel.org/all/20260525132253.1889726-1-wutengda@xxxxxxxxxxxxxxx/
>>
>> --- a/kernel/trace/rethook.c
>> +++ b/kernel/trace/rethook.c
>> @@ -250,9 +250,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>> if (WARN_ON_ONCE(!cur))
>> return 0;
>>
>> - if (tsk != current && task_is_running(tsk))
>> - return 0;
>> -
>
> The description of the function should be updated as well. It still
> mentions:
>
> * The @tsk must be 'current' or a task which is not running.
>
> Instead it should explain that it safe to call the function even
> on another running tasks but the returned address is not reliable
> then.
>

Oh, I forgot that. Thanks for pointing it out.

>> do {
>> ret = __rethook_find_ret_addr(tsk, cur);
>> if (!ret)
>
> I am still a bit concerned about the motivation.
>
> Tengda mentioned at
> https://lore.kernel.org/all/679a1c8f-1e4d-4ae5-83e1-d0068e6de1a6@xxxxxxxxxxxxxxx/
> that they tried to verify livepatching:
>
> <paste>
> Background: We are verifying the support of live patches for functions that
> have a kretprobe. The specific verification method is as follows:
>
> We construct a function foo() that calls bar():
>
> void bar(void)
> {
> for (;;) {
> schedule();
> }
> }
>
> void foo(void)
> {
> bar();
> }
>
> A kretprobe is attached to bar():
>
> echo 'r:rp1 bar' > /sys/kernel/tracing/kprobe_events
> echo 1 > /sys/kernel/tracing/events/kprobes/rp1/enable
>
> Then foo() is triggered. The expected behavior is that bar() will call
> schedule() and yield the CPU.
>
> After that, the live patch is activated to attempt replacing the implementation
> of foo(). The expectation is that this should succeed.
>
> However, in reality, because the task that called schedule() is still in the
> RUNNING state, the condition task_is_running(tsk) inside rethook_find_ret_addr()
> is not satisfied, causing the function to return early. This, in turn,
> prevents stack_trace_save_tsk_reliable() from determining the stack as
> reliable, leading to a failure in activating the live patch.
>
> **Not sure if this is correct:**
>
> We believe that after a task voluntarily calls schedule(), when the stack
> is expected to be reliable, it is a safe time to activate a live patch.
> Additionally, a similar tsk->on_cpu check can be found elsewhere in the
> kernel (See task_on_another_cpu() in arch/x86/include/asm/unwind.h).
> Therefore, we propose changing the task_is_running(tsk) condition to
> tsk->on_cpu.
> </paste>
>
> More background:
> ----------------
>
> The test is artificial because it keeps the RUNNING state before
> calling schedule, see
> https://lore.kernel.org/all/20260608093449.GH4149641@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
>
>
>
> My questions:
>
> Does this patch allows to livepatch the above mentioned test code?

At least it will no longer be blocked by the rethook check.

> Is the livepatching safe?
> Does it help in another scenarios?
>
>
> My opinion:
>
> The livepatching might be safe only when the process is migrating
> itself. I mean that it might be safe even when it is RUNNING as long
> at it is _current_.
>
> I agree that we do not need to enforce this in rethook_find_ret_addr()
> if the function is used also in other scenarios, for example, by
> ftrace/BTF for taking snapshots of other processes.
>
> But we need to make sure that the backtrace is reliable when
> livepatching (migrating) the task.
>
> Best Regards,
> Petr

Peter actually touched on this point earlier:

<paste>
Things like live-patch use task_call_func(), which ensures the callback
function is done while holding sufficient locks for the task to not
change state.
</paste>

>From my understanding, removing this check should not introduce
stack reliability issues for livepatch.

Thanks,
Tengda