Re: [PATCH bpf] bpf, sockmap: Fix af_unix null-ptr-deref in proto update

From: Martin KaFai Lau

Date: Mon Feb 02 2026 - 14:15:50 EST


On 1/31/26 2:06 AM, Kuniyuki Iwashima wrote:
On Fri, Jan 30, 2026 at 1:30 PM Martin KaFai Lau <martin.lau@xxxxxxxxx> wrote:

On 1/30/26 3:00 AM, Michal Luczaj wrote:
Follow-up to discussion at
https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@xxxxxxxxxx/.

It is a long thread to dig. Please summarize the discussion in the
commit message.

OK, there we go:

The root cause of the null-ptr-deref is that unix_stream_connect() sets
sk_state (`WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)`) _before_ it assigns
a peer (`unix_peer(sk) = newsk`). sk_state == TCP_ESTABLISHED makes
sock_map_sk_state_allowed() believe that socket is properly set up, which
would include having a defined peer.

In other words, there's a window when you can call
unix_stream_bpf_update_proto() on socket which still has unix_peer(sk) == NULL.

My initial idea was to simply move peer assignment _before_ the sk_state
update, but the maintainer wasn't interested in changing the
unix_stream_connect() hot path. He suggested taking care of it in the
sockmap code.

Yes, we already have a memory barrier for unix_peer(sk) there
(to save sock_hold()/sock_put() in sendmsg(), see 830a1e5c212fb)
and another one just for sk->sk_state is not worth the unlikely
case in sockmap by a buggy user.



My understanding is that users are not supposed to put sockets in a sockmap
when said socket is only half-way through connect() call. Hence `return
-EINVAL` on a missing peer. Now, if users should be allowed to legally race
connect() vs. sockmap update, then I guess we can wait for connect() to
"finalize" e.g. by taking the unix_state_lock(), as discussed below.

If a user hit the issue, the user must have updated sockmap while the
user knows connect() had not returned. Such a user must prepare
for failures since it could occur before sock_map_sk_state_allowed() too.

This is a subtle timing issue and I don't think the kernel should be
friendly to such buggy users by waiting for connect() etc.

I don't have a use case for parallel connect and map update either. Also, I have no concern about returning -EINVAL in map_update for the not-yet-completed unix_stream_connect().

However, TCP/UDP (and probably vsock also?) do not have this racing issue because sock_map follows the lock usage in TCP/UDP as in other parts of the kernel. Why AF_UNIX is an exception and unix_state_lock() is not used in sock_map.

John and Jakub Stinicki, could this be an oversight?


From looking at this commit message, if the existing lock_sock held by
update_elem is not useful for af_unix,

Right, the existing lock_sock is not useful. update's lock_sock holds
sock::sk_lock, while unix_state_lock() holds unix_sock::lock.

It sounds like lock_sock is the incorrect lock to hold for af_unix. Is
taking lock_sock in sock_map doing anything useful for af_unix? Should
sock_map hold the unix_state_lock instead of lock_sock?

If sockmap code does not sleep, unix_state_lock can be used there.

afaik, bpf prog using the sockmap cannot sleep. Search for "if (prog->sleepable)" in "check_map_prog_compatibility()". The bpf prog attached to sock_map cannot sleep either, there is "can_be_sleepable()" check in the verifier.


if (!psock->sk_pair) {
+ unix_state_lock(sk);
+ unix_state_unlock(sk);

I don't like this... we had a similar one in recvmsg(MSG_PEEK) path
for GC with a biiiiiig comment, which I removed in 118f457da9ed .

Me neither, for both empty critical section and big fat comment. This usually suggests something else is incorrect to begin with. I believe the wrong lock is taken in this case. Unless there is something prohibiting taking unix_state_lock in sock_map, fix it properly instead.



sk_pair = unix_peer(sk);
sock_hold(sk_pair);

I don't have a strong opinion on waiting or checking NULL. imo, both are
not easy to understand. One is sk_state had already been checked earlier
under a lock_sock but still needs to check NULL on unix_peer(). Another
one is an empty unix_state_[un]lock(). If taking unix_state_lock, may as
well just use the existing unix_peer_get(sk).

Yes, unix_peer_get() can be safely used there (with extra sock_put()).

The sock_map needs to sock_hold(sk_pair) anyway and stores it in psock->sk_pair, so I don't think it needs an extra sock_put().



If its return value cannot
(?) be NULL, WARN_ON_ONCE() instead of having a special empty

I suggested WARN_ON_ONCE() because Michal reproduced it with
mdelay() and I did not think it could occur in real life, but considering
PREEMPT_RT, it could be real. So, the current form in this patch looks
good to me.

hmm... If unix_peer_get(sk) is used and it takes (and waits for) the unix_state_lock, it shouldn't be NULL? The above empty [un]lock idea does not check for NULL on unix_peer() either. Or am I missing something?

Regardless, if the proper lock is held, all this complication and reasoning will go away.