Re: [PATCH] net: skmsg: pin the delayed-work psock in sk_psock_backlog

From: Cen Zhang

Date: Fri May 15 2026 - 04:18:56 EST


Dear Jiayuan Chen

Jiayuan Chen <jiayuan.chen@xxxxxxxxx> 于2026年5月15日周五 14:10写道:
> Where is the 'last_old_ref_before_put' symbol from? I can't find it
> anywhere in the tree.
>
> If you are using LLMs to dig into races like this, please also have them
> produce a reproducer, e.g. patch mdelay() into
>
> the relevant windows to widen them, then trigger it from userspace.
>
>

Hi Jiayuan,

Thanks for checking this. You are right: last_old_ref_before_put is
not an in-tree kernel symbol. It was a temporary validation probe
label which recorded the old psock refcount immediately before the
backlog worker's final put, and it should not have appeared in the
commit message as if it were kernel output.

The in-tree path I was trying to describe is:

sk_psock_backlog() starts at net/core/skmsg.c:670.
get path: sk_psock_get(psock->sk), net/core/skmsg.c:692.
put path: sk_psock_put(psock->sk, psock), net/core/skmsg.c:746.
detach clears sk_user_data at net/core/skmsg.c:892.
reattach publishes a replacement psock at net/core/skmsg.c:793.
warning path: REFCOUNT_SUB_UAF at lib/refcount.c:28.

The trigger was based on the in-tree sockmap_redir BPF selftest
under tools/testing/selftests/bpf/prog_tests/.
The one-shot test used AF_UNIX SOCK_STREAM socket pairs, attached
the sk_skb verdict program to the input map, inserted one socket
into the input map and one destination socket into the sockmap at
key 0, then sent one byte through the input peer so the destination
psock backlog worker was queued.
For validation I used a temporary local instrumentation patch in
net/core/skmsg.c. It added a debugfs-controlled gate in
sk_psock_backlog() after the TX-enabled check and before the
existing sk_psock_get(psock->sk) call, plus counters and pr_info()
snapshots in sk_psock_backlog(), sk_psock_init() and
sk_psock_drop(). It also stored the pointer returned by
sk_psock_get(psock->sk) for logging. The worker still used the
existing get path and the existing sk_psock_put(psock->sk, psock)
exit path.
With the worker parked before sk_psock_get(psock->sk), the test
forked: the child deleted the destination sockmap entry, and the
parent retried BPF_NOEXIST update of the same key with the same
destination socket fd until reattach succeeded.
After the delete completed, the test released the old worker. At
that point sk->sk_user_data referred to the replacement psock, while
the delayed work still belonged to the old psock. The recorded state
before the warning had the sk_user_data psock and the psock returned
by sk_psock_get(psock->sk) equal to each other, but different from
the delayed-work container. The instrumentation was only used to make
that interleaving deterministic and observable. The warning below is
the kernel's normal refcount warning path.

The native kernel report from that run was:

refcount_t: underflow; use-after-free.
WARNING: lib/refcount.c:28 at refcount_warn_saturate+0xbf/0xf0
Workqueue: events sk_psock_backlog
RIP: 0010:refcount_warn_saturate+0xbf/0xf0
Call trace:
sk_psock_backlog() (net/core/skmsg.c:670)
process_one_work() (kernel/workqueue.c:3200)

So the reproducer is instrumentation-assisted, not an
unmodified upstream selftest. The instrumentation can widen the
race window and record the participating psock pointers, but it
does not publish a replacement psock, clear sk->sk_user_data, or
add an extra put on the old psock. The final warning is reached
through the existing sk_psock_put(psock->sk, psock) path after
the test has forced delete-plus-reattach to happen before the
parked worker resumes.

I will send v2 as a new thread after the netdev 24-hour
interval, with the lab probe label removed from the commit text.
If useful, I can also share the small instrumentation/selftest
diff separately to show the exact widened window.

Thanks,
Zhang Cen