Re: [PATCH v4] rethook: Remove the running task check in rethook_find_ret_addr()

From: Tengda Wu

Date: Thu Jun 11 2026 - 21:55:26 EST

Hi Xiao,

Thank you very much for your detailed analysis and verification.

On 2026/6/11 23:53, XIAO WU wrote:
> Hi Tengda,
>
> Sashiko [1] reviewed this patch and found that removing the
> task_is_running() check exposes stack unwinders to real crashes — not
> just "invalid information." A PoC confirms this: a KASAN panic triggers
> within seconds when /proc/<pid>/stack reads the stack of a task that is
> concurrently running a kretprobe.
>
> [1] https://sashiko.dev/#/patchset/20260610013658.1837963-1-wutengda%40huaweicloud.com
>
>> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
>> index 5a8bdf88999a..1e7fdebe3cd5 100644
>> --- a/kernel/trace/rethook.c
>> +++ b/kernel/trace/rethook.c
>> @@ -250,9 +251,6 @@ unsigned long rethook_find_ret_addr(struct task_struct *tsk, unsigned long frame
>> if (WARN_ON_ONCE(!cur))
>> return 0;
>>
>> - if (tsk != current && task_is_running(tsk))
>> - return 0;
>> -
>> do {
>> ret = __rethook_find_ret_addr(tsk, cur);
>> if (!ret)
>
> The commit message states:
>
>> The iteration is already safe from crashes because
>> unwind_next_frame() holds RCU and rethook_node structures are
>> RCU-freed; even if the iteration goes off the rails and returns
>> invalid information, it will not crash.
>
> There are two problems with this claim, both reproducible.
>
> **Problem 1: stack-out-of-bounds in unwind_next_frame itself**
>
> The PoC below reliably triggers the following KASAN panic — not in the
> rethook list traversal, but inside unwind_next_frame():
>
> [ 1833.494623] BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x861/0x2080
> [ 1833.494651] Read of size 2 at addr ffffc90003e6f5f0 by task poc/9854
> [ 1833.494707] Call Trace:
> [ 1833.494719] dump_stack_lvl+0x116/0x1f0
> [ 1833.494743] print_report+0xf4/0x600
> [ 1833.494788] kasan_report+0xe0/0x110
> [ 1833.494836] unwind_next_frame+0x861/0x2080
> [ 1833.494948] arch_stack_walk+0x99/0x100
> [ 1833.495000] stack_trace_save_tsk+0x16a/0x200
> [ 1833.495054] proc_pid_stack+0x173/0x2b0
> [ 1833.495103] seq_read_iter+0x519/0x12d0
> [ 1833.495166] seq_read+0x3b7/0x590
> [ 1833.495297] vfs_read+0x1f5/0xd20
> [ 1833.495497] ksys_read+0x135/0x250
> [ 1833.495549] do_syscall_64+0x129/0x850
> [ 1833.495566] entry_SYSCALL_64_after_hwframe+0x77/0x7f
> [ 1833.498894] Kernel panic - not syncing: KASAN: panic_on_warn set ...
>
> page last free pid 9737 tgid 9737 stack trace:
> do_sys_openat2+0xbf/0x260 <-- target task inside kretprobe
> __x64_sys_openat+0x179/0x210
>
> This crash has nothing to do with rethook_node lifetimes or RCU. It
> happens because the ORC unwinder reads stack memory while the target
> task concurrently executes a kretprobe trampoline that modifies return
> addresses. The unwinder follows corrupted frame data past valid stack
> boundaries. RCU protection of rethook_node structures is irrelevant —
> this crash occurs at the stack frame interpretation level, before any
> rethook list traversal.
>
> The old task_is_running() check prevented the unwinder from attempting
> to unwind a running task's stack in the first place.
>

Yes, I was able to reproduce the issue locally using your PoC. The problem
does exist as you described. I need to take a deeper look and figure out how
to properly fix it.

> **Problem 2: use-after-free via rethook_node recycling**
>
> Even if the stack-out-of-bounds above were addressed, a second crash
> path exists in the rethook list traversal itself.
>
> rethook_recycle() immediately pushes nodes back to the objpool without
> an RCU grace period:
>
> kernel/trace/rethook.c:
> void rethook_recycle(struct rethook_node *node)
> {
> ...
> objpool_push(node, &node->rethook->pool);
> }
>
> Meanwhile, unwind_next_frame() in arch/x86/kernel/unwind_orc.c drops
> RCU between frames while the cursor (*cur) persists across iterations:
>
> arch/x86/kernel/unwind_orc.c:
> bool unwind_next_frame(...)
> {
> ...
> guard(rcu)(); // RCU held for one frame
> ...
> } // RCU dropped here
>
> When the unwinder calls __rethook_find_ret_addr() in the next frame
> iteration, it does:
>
> struct llist_node *first = tsk->rethooks.first;
> ...
> *cur = first;
> ...
> node = node->next; // node may have been recycled
>
> If the target task returns from a probed function between frames, its
> rethook_node is recycled and can be instantly reallocated to another
> task. The unwinder's stale cursor then dereferences a freed pointer,
> leading to use-after-free.
>

Yes, Sashiko also pointed this out. You have opened this issue for further
analysis to clarify its fault model. It appears to be a pre-existing issue
that may require a separate patch to resolve.

> ## Reproducer
>
> The PoC sets up a kretprobe on do_sys_openat2, creates hot-loop threads
> calling open(), and concurrently reads /proc/<tid>/stack. The race
> triggers within seconds (Problem 1 above; Problem 2 may reproduce on
> kernels without KASAN or with different timing).
>
> Build: gcc -static -pthread -o poc poc.c
> Run: ./poc [runtime_seconds]
> Needs: root, CONFIG_KASAN=y
>
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
> #include <unistd.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <sys/wait.h>
> #include <sys/syscall.h>
> #include <sched.h>
> #include <fcntl.h>
> #include <errno.h>
> #include <signal.h>
> #include <pthread.h>
> #include <dirent.h>
>
> #define TRACE "/sys/kernel/tracing"
>
> volatile int stop = 0;
>
> static int tfs(const char *f, const char *b)
> {
> char p[256]; int fd, r;
> snprintf(p, 256, "%s/%s", TRACE, f);
> fd = open(p, O_WRONLY | O_TRUNC);
> if (fd < 0) {
> system("mount -t tracefs tracefs /sys/kernel/tracing 2>/dev/null");
> usleep(50000);
> fd = open(p, O_WRONLY | O_TRUNC);
> }
> if (fd < 0) return -1;
> r = write(fd, b, strlen(b));
> close(fd);
> return r < 0 ? -1 : 0;
> }
>
> void *hot_thread(void *arg)
> {
> while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
> int fd = open("/dev/null", O_RDONLY);
> if (fd >= 0) close(fd);
> }
> return NULL;
> }
>
> void *reader_thread(void *arg)
> {
> pid_t target = *(pid_t *)arg;
> char path[64], buf[8192];
> snprintf(path, 64, "/proc/%d/stack", target);
> while (!__atomic_load_n(&stop, __ATOMIC_RELAXED)) {
> int fd = open(path, O_RDONLY);
> if (fd >= 0) { read(fd, buf, 8191); close(fd); }
> }
> return NULL;
> }
>
> void sigh(int s) { stop = 1; }
>
> int main(int argc, char *argv[])
> {
> int runtime = 120;
> if (argc > 1) runtime = atoi(argv[1]);
>
> printf("rethook race PoC\n");
> if (geteuid()) { printf("root needed\n"); return 1; }
> signal(SIGINT, sigh);
>
> pthread_t hot[4], rdr[4];
> pid_t hot_tids[4];
> int pairs = 4;
>
> for (int c = 0; c < runtime / 5 && !stop; c++) {
> tfs("events/kprobes/myretprobe/enable", "0");
> tfs("kprobe_events", "-:myretprobe");
> usleep(100);
> tfs("kprobe_events", "r:myretprobe do_sys_openat2 $retval");
> tfs("events/kprobes/myretprobe/enable", "1");
>
> pid_t main_tid = syscall(SYS_gettid);
>
> for (int i = 0; i < pairs; i++)
> pthread_create(&hot[i], NULL, hot_thread, NULL);
>
> usleep(300000);
>
> {
> DIR *d = opendir("/proc/self/task");
> int cnt = 0;
> if (d) {
> struct dirent *de;
> while ((de = readdir(d)) != NULL && cnt < pairs) {
> pid_t t = atoi(de->d_name);
> if (t > 0 && t != main_tid)
> hot_tids[cnt++] = t;
> }
> closedir(d);
> }
> for (int i = 0; i < cnt; i++)
> pthread_create(&rdr[i], NULL, reader_thread, &hot_tids[i]);
> }
>
> printf("round %d\n", c);
> sleep(5);
>
> stop = 1;
> usleep(100000);
>
> for (int i = 0; i < pairs; i++) pthread_join(hot[i], NULL);
> for (int i = 0; i < pairs; i++) pthread_join(rdr[i], NULL);
>
> stop = 0;
> usleep(1000);
> }
>
> tfs("events/kprobes/myretprobe/enable", "0");
> tfs("kprobe_events", "-:myretprobe");
> printf("Done\n");
> return 0;
> }
>
> ## Summary
>
> The v4 commit message claims the iteration "will not crash," but the PoC
> demonstrates a reproducible KASAN panic:
>
> 1. stack-out-of-bounds in unwind_next_frame (ORC unwinder reads
> concurrently-modified stack frames of a running task)
>
> 2. Potential use-after-free in __rethook_find_ret_addr (rethook nodes
> recycled without RCU grace period, cursor persists across RCU drops)
>
> The old task_is_running() check was racy but served as a practical
> safety net. Removing it without adding equivalent protection in the
> callers (proc_pid_stack, BPF stack walkers) exposes users to kernel
> panics via /proc/<pid>/stack on any task running a kretprobe.

Once again, I truly appreciate your thorough review and testing.
I'm not sure if I can fully resolve these issues, but if I succeed, I will
send out a v5.

Best regards,
Tengda