Re: [PATCH bpf-next v2 00/26] Resilient Queued Spin Lock
From: Alexei Starovoitov
Date: Tue Mar 04 2025 - 22:26:37 EST
On Tue, Mar 4, 2025 at 2:46 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
>
> Anyway; the situation I was thinking of was something along the lines
> of: you need data from 2 buckets, so you need to lock 2 buckets, but
> since hash-table, there is no sane order, so you need a 3rd lock to
> impose order.
Not quite. This is a typical request to allow locking two buckets
and solution to that is run-time lock address check.
> > q1 = bpf_map_lookup_elem(&vqueue, &key1);
> > q2 = bpf_map_lookup_elem(&vqueue, &key2);
> >
> > both will point to two different locks,
> > and since the key is dynamic there is no way to know
> > the order of q1->lock vs q2->lock.
>
> I still feel like I'm missing things, but while they are two dynamic
> locks, they are both locks of vqueue object. What lockdep does is
> classify locks by initialization site (by default). Same can be done
> here, classify per dynamic object.
>
> So verifier can know the above is invalid. Both locks are same class, so
> treat as A-A order (trivial case is where q1 and q2 are in fact the same
> object since the keys hash the same).
Sounds like you're saying that the verifier should reject
the case when two locks of the same class like q1->lock and q2->lock
need to be taken ?
But that is one of the use cases where people requested to allow
multiple locks.
The typical solution to this is to order locks by addresses at runtime.
And nf_conntrack_double_lock() in net/netfilter/nf_conntrack_core.c
does exactly that.
if (lock1 < lock2) {
spin_lock(lock1);spin_lock(lock2);
} else {
spin_lock(lock2);spin_lock(lock1);
}
> Now, going back to 3rd lock, if instead you write it like:
>
> bpf_spin_lock(&glock);
> q1 = bpf_map_lookup_elem(&vqueue, &key1);
> q2 = bpf_map_lookup_elem(&vqueue, &key2);
> ...
> bpf_spin_unlock(&glock);
>
> then (assuming q1 != q2) things are fine, since glock will serialize
> everybody taking two vqueue locks.
>
> And the above program snippet seems to imply maps are global state, so
Not quite. Some maps are global, but there are dynamic maps too.
That's what map-in-map is for.
> you can keep lock graph of maps, such that:
>
> bpf_map_lookup_elem(&map-A, &key-A);
> bpf_map_lookup_elem(&map-B, &key-B);
>
> vs
>
> bpf_map_lookup_elem(&map-B, &key-B);
> bpf_map_lookup_elem(&map-A, &key-A);
>
> trips AB-BA
If everything was static and _keys_ known statically too, then yes,
such analysis by the verifier would be possible.
But both maps and keys are dynamic.
Note, to make sure that the above example doesn't confuse people,
bpf_map_lookup_elem() lookup itself is completely lockless.
So nothing wrong with the above sequence as written.
Only when:
q1 = bpf_map_lookup_elem(&map-A, &key-A);
q2 = bpf_map_lookup_elem(&map-B, &key-B);
if (bpf_res_spin_lock(&q1->lock))
if (bpf_res_spin_lock(&q2->lock))
the deadlocks become a possibility.
Both maps and keys are only known at run-time.
So locking logic has to do run-time checks too.
> I am not at all sure how res_spin_lock is helping with the q1,q2 thing.
> That will trivially result in lock cycles.
Right and AA or ABBA will be instantly detected at run-time.
> And you said any program that would trigger deadlock is invalid.
> Therefore the q1,q2 example from above is still invalid and
> res_spin_lock has not helped.
res_spin_lock will do its job and will prevent a deadlock.
As we explained earlier such a program will be marked as broken
and will be detached/stopped by the bpf infra.
Also we're talking root privileges.
None of this is allowed in unpriv.
> > Just to make it clear... there is a patch 18:
> >
> > F: kernel/bpf/
> > F: kernel/trace/bpf_trace.c
> > F: lib/buildid.c
> > +F: arch/*/include/asm/rqspinlock.h
> > +F: include/asm-generic/rqspinlock.h
> > +F: kernel/locking/rqspinlock.c
> > F: lib/test_bpf.c
> > F: net/bpf/
> >
> > that adds maintainer entries to BPF scope.
> >
> > We're not asking locking experts to maintain this new res_spin_lock.
> > It's not a generic kernel infra.
> > It will only be used by bpf infra and by bpf progs.
> > We will maintain it and we will fix whatever bugs
> > we introduce.
>
> While that is appreciated, the whole kernel is subject to the worst case
> behaviour of this thing. As such, I feel I need to care.
Not sure why you're trying to relitigate the years worth of
discussions around locks in the bpf community.
Static analysis of 2+ locks by the verifier is impossible.
Full lock graph cycle detection lockdep-style is too slow in run-time.
Hence res_spin_lock with AA, ABBA, and timeout as a last resort
is our solution to real reported bugs.
This res_spin_lock patchset fixes the following syzbot reports:
https://lore.kernel.org/bpf/675302fd.050a0220.2477f.0004.GAE@xxxxxxxxxx
https://lore.kernel.org/bpf/000000000000b3e63e061eed3f6b@xxxxxxxxxx
https://lore.kernel.org/bpf/CAPPBnEa1_pZ6W24+WwtcNFvTUHTHO7KUmzEbOcMqxp+m2o15qQ@xxxxxxxxxxxxxx
https://lore.kernel.org/bpf/CAPPBnEYm+9zduStsZaDnq93q1jPLqO-PiKX9jy0MuL8LCXmCrQ@xxxxxxxxxxxxxx
https://lore.kernel.org/lkml/000000000000adb08b061413919e@xxxxxxxxxx
It fixes the real issues.
Some of them have hacky workarounds, some are not fixed yet.
More syzbot reports will be fixed in follow ups when we
adopt res_spin_lock in other parts of bpf infra.
Note, all of the above syzbot reports are _not_ using direct
locks inside the bpf programs. All of them hit proper kernel
spin_locks inside bpf infra (like inside map implementations and such).
The verifier cannot do anything. syzbot generated programs
are trivial. They do one bpf_map_update_elem() call or similar.
It's a combination of attaching to tricky tracepoints
like trace_contention_begin or deep inside bpf infra.
We already have these workarounds:
CFLAGS_REMOVE_percpu_freelist.o = $(CC_FLAGS_FTRACE)
CFLAGS_REMOVE_bpf_lru_list.o = $(CC_FLAGS_FTRACE)
CFLAGS_REMOVE_queue_stack_maps.o = $(CC_FLAGS_FTRACE)
CFLAGS_REMOVE_lpm_trie.o = $(CC_FLAGS_FTRACE)
CFLAGS_REMOVE_ringbuf.o = $(CC_FLAGS_FTRACE)
to prevent recursion anywhere in these files,
but it's helping only so much.
So please take a look at patches 1-18 and help us make
sure we implemented AA, ABBA, timeout logic without obvious bugs.
I think we did, but extra review would be great.
Thanks!