Re: [PATCH v2 0/9] Remove spin_unlock_wait()

From: Paul E. McKenney
Date: Sat Jul 08 2017 - 10:46:27 EST


On Sat, Jul 08, 2017 at 02:30:19PM +0200, Ingo Molnar wrote:
>
> * Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>
> > On Sat, Jul 08, 2017 at 10:35:43AM +0200, Ingo Molnar wrote:
> > >
> > > * Manfred Spraul <manfred@xxxxxxxxxxxxxxxx> wrote:
> > >
> > > > Hi Ingo,
> > > >
> > > > On 07/07/2017 10:31 AM, Ingo Molnar wrote:
> > > > >
> > > > > There's another, probably just as significant advantage: queued_spin_unlock_wait()
> > > > > is 'read-only', while spin_lock()+spin_unlock() dirties the lock cache line. On
> > > > > any bigger system this should make a very measurable difference - if
> > > > > spin_unlock_wait() is ever used in a performance critical code path.
> > > > At least for ipc/sem:
> > > > Dirtying the cacheline (in the slow path) allows to remove a smp_mb() in the
> > > > hot path.
> > > > So for sem_lock(), I either need a primitive that dirties the cacheline or
> > > > sem_lock() must continue to use spin_lock()/spin_unlock().
> > >
> > > Technically you could use spin_trylock()+spin_unlock() and avoid the lock acquire
> > > spinning on spin_unlock() and get very close to the slow path performance of a
> > > pure cacheline-dirtying behavior.
> > >
> > > But adding something like spin_barrier(), which purely dirties the lock cacheline,
> > > would be even faster, right?
> >
> > Interestingly enough, the arm64 and powerpc implementations of
> > spin_unlock_wait() were very close to what it sounds like you are
> > describing.
>
> So could we perhaps solve all our problems by defining the generic version thusly:
>
> void spin_unlock_wait(spinlock_t *lock)
> {
> if (spin_trylock(lock))
> spin_unlock(lock);
> }
>
> ... and perhaps rename it to spin_barrier() [or whatever proper name there would
> be]?

As lockdep, 0day Test Robot, Linus Torvalds, and several others let me
know in response to my original (thankfully RFC!) patch series, this needs
to disable irqs to work in the general case. For example, if the lock
in question is an irq-disabling lock, you take an interrupt just after
a successful spin_trylock(), and that interrupt acquires the same lock,
the actuarial statistics of your kernel degrade sharply and suddenly.

What I get for sending out untested patches! :-/

> Architectures can still optimize it, to remove the small window where the lock is
> held locally - as long as the ordering is at least as strong as the generic
> version.
>
> This would have various advantages:
>
> - semantics are well-defined
>
> - the generic implementation is already pretty well optimized (no spinning)
>
> - it would make it usable for the IPC performance optimization
>
> - architectures could still optimize it to eliminate the window where the lock is
> held locally - if there's such instructions available.
>
> Was this proposed before, or am I missing something?

It was sort of proposed...

https://marc.info/?l=linux-arch&m=149912878628355&w=2

But do we have a situation where normal usage of spin_lock() and
spin_unlock() is causing performance or scalability trouble?

(We do have at least one situation in fnic that appears to be buggy use of
spin_is_locked(), and proposing a patch for that case in on my todo list.)

Thanx, Paul