Re: [PATCH bpf-next v2 17/26] rqspinlock: Hardcode cond_acquire loops to asm-generic implementation
From: Kumar Kartikeya Dwivedi
Date: Fri Feb 07 2025 - 22:05:27 EST
On Sat, 8 Feb 2025 at 02:58, Alexei Starovoitov
<alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Thu, Feb 6, 2025 at 2:55 AM Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx> wrote:
> >
> > Currently, for rqspinlock usage, the implementation of
> > smp_cond_load_acquire (and thus, atomic_cond_read_acquire) are
> > susceptible to stalls on arm64, because they do not guarantee that the
> > conditional expression will be repeatedly invoked if the address being
> > loaded from is not written to by other CPUs. When support for
> > event-streams is absent (which unblocks stuck WFE-based loops every
> > ~100us), we may end up being stuck forever.
> >
> > This causes a problem for us, as we need to repeatedly invoke the
> > RES_CHECK_TIMEOUT in the spin loop to break out when the timeout
> > expires.
> >
> > Hardcode the implementation to the asm-generic version in rqspinlock.c
> > until support for smp_cond_load_acquire_timewait [0] lands upstream.
> >
> > [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@xxxxxxxxxx
> >
> > Cc: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
> > Signed-off-by: Kumar Kartikeya Dwivedi <memxor@xxxxxxxxx>
> > ---
> > kernel/locking/rqspinlock.c | 41 ++++++++++++++++++++++++++++++++++---
> > 1 file changed, 38 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/locking/rqspinlock.c b/kernel/locking/rqspinlock.c
> > index 49b4f3c75a3e..b4cceeecf29c 100644
> > --- a/kernel/locking/rqspinlock.c
> > +++ b/kernel/locking/rqspinlock.c
> > @@ -325,6 +325,41 @@ int __lockfunc resilient_tas_spin_lock(rqspinlock_t *lock, u64 timeout)
> > */
> > static DEFINE_PER_CPU_ALIGNED(struct qnode, qnodes[_Q_MAX_NODES]);
> >
> > +/*
> > + * Hardcode smp_cond_load_acquire and atomic_cond_read_acquire implementations
> > + * to the asm-generic implementation. In rqspinlock code, our conditional
> > + * expression involves checking the value _and_ additionally a timeout. However,
> > + * on arm64, the WFE-based implementation may never spin again if no stores
> > + * occur to the locked byte in the lock word. As such, we may be stuck forever
> > + * if event-stream based unblocking is not available on the platform for WFE
> > + * spin loops (arch_timer_evtstrm_available).
> > + *
> > + * Once support for smp_cond_load_acquire_timewait [0] lands, we can drop this
> > + * workaround.
> > + *
> > + * [0]: https://lore.kernel.org/lkml/20250203214911.898276-1-ankur.a.arora@xxxxxxxxxx
> > + */
>
> It's fine as a workaround for now to avoid being blocked
> on Ankur's set (which will go via different tree too),
> but in v3 pls add an extra patch that demonstrates the final result
> with WFE stuff working as designed without amortizing
> in RES_CHECK_TIMEOUT() macro.
> Guessing RES_CHECK_TIMEOUT will have some ifdef to handle that case?
Yes, or we can pass in the check_timeout expression directly. I'll
make the change in v3.