Re: Unfair qspinlocks on ARM64 without LSE atomics => 3ms delay in interrupt handling

From: Will Deacon
Date: Fri Mar 24 2023 - 14:09:20 EST


On Fri, Mar 24, 2023 at 05:01:28PM +0000, Catalin Marinas wrote:
> On Fri, Mar 24, 2023 at 08:43:38AM +0000, Bouska, Zdenek wrote:
> > I have seen ~3 ms delay in interrupt handling on ARM64.
> >
> > I have traced it down to raw_spin_lock() call in handle_irq_event() in
> > kernel/irq/handle.c:
> >
> > irqreturn_t handle_irq_event(struct irq_desc *desc)
> > {
> > irqreturn_t ret;
> >
> > desc->istate &= ~IRQS_PENDING;
> > irqd_set(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> > raw_spin_unlock(&desc->lock);
> >
> > ret = handle_irq_event_percpu(desc);
> >
> > --> raw_spin_lock(&desc->lock);
> > irqd_clear(&desc->irq_data, IRQD_IRQ_INPROGRESS);
> > return ret;
> > }
> >
> > It took ~3 ms for this raw_spin_lock() to lock.
>
> That's quite a large indeed.
>
> > During this time irq_finalize_oneshot() from kernel/irq/manage.c locks and
> > unlocks the same raw spin lock more than 1000 times:
> >
> > static void irq_finalize_oneshot(struct irq_desc *desc,
> > struct irqaction *action)
> > {
> > if (!(desc->istate & IRQS_ONESHOT) ||
> > action->handler == irq_forced_secondary_handler)
> > return;
> > again:
> > chip_bus_lock(desc);
> > --> raw_spin_lock_irq(&desc->lock);
> >
> > /*
> > * Implausible though it may be we need to protect us against
> > * the following scenario:
> > *
> > * The thread is faster done than the hard interrupt handler
> > * on the other CPU. If we unmask the irq line then the
> > * interrupt can come in again and masks the line, leaves due
> > * to IRQS_INPROGRESS and the irq line is masked forever.
> > *
> > * This also serializes the state of shared oneshot handlers
> > * versus "desc->threads_oneshot |= action->thread_mask;" in
> > * irq_wake_thread(). See the comment there which explains the
> > * serialization.
> > */
> > if (unlikely(irqd_irq_inprogress(&desc->irq_data))) {
> > --> raw_spin_unlock_irq(&desc->lock);
> > chip_bus_sync_unlock(desc);
> > cpu_relax();
> > goto again;
> > }
>
> So this path is hammering the desc->lock location and another CPU cannot
> change it. As you found, the problem is not the spinlock algorithm but
> the atomic primitives. The LDXR/STXR constructs on arm64 are known to
> have this issue with STXR failing indefinitely. raw_spin_unlock() simply
> does an STLR and this clears the exclusive monitor that the other CPU
> may have set with LDXR but before the STXR. The queued spinlock only
> provides fairness if the CPU manages to get in the queue.
>
> > So I confirmed that atomic operations from
> > arch/arm64/include/asm/atomic_ll_sc.h can be quite slow when they are
> > contested from second CPU.
> >
> > Do you think that it is possible to create fair qspinlock implementation
> > on top of atomic instructions supported by ARM64 version 8 (no LSE atomic
> > instructions) without compromising performance in the uncontested case?
> > For example ARM64 could have custom queued_fetch_set_pending_acquire
> > implementation same as x86 has in arch/x86/include/asm/qspinlock.h. Is the
> > retry loop in irq_finalize_oneshot() ok together with the current ARM64
> > cpu_relax() implementation for processor with no LSE atomic instructions?
>
> So is the queued_fetch_set_pending_acquire() where it gets stuck or the
> earlier atomic_try_cmpxchg_acquire() before entering on the slow path? I
> guess both can fail in a similar way.
>
> A longer cpu_relax() here would improve things (on arm64 this function
> is a no-op) but maybe Thomas or Will have a better idea.

I had a pretty gross cpu_relax() implementation using wfe somewhere on
LKML, so you could try that if you can dig it up.

Generally though, LDXR/STXR and realtime don't mix super well.

Will