Re: [PATCH v2 0/9] Remove spin_unlock_wait()

From: Ingo Molnar
Date: Fri Jul 07 2017 - 04:31:41 EST

* Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:

> On Thu, Jul 06, 2017 at 09:20:24AM -0700, Paul E. McKenney wrote:
> > On Thu, Jul 06, 2017 at 06:05:55PM +0200, Peter Zijlstra wrote:
> > > On Thu, Jul 06, 2017 at 02:12:24PM +0000, David Laight wrote:
> > > > From: Paul E. McKenney
> >
> > [ . . . ]
> >
> > > Now on the one hand I feel like Oleg that it would be a shame to loose
> > > the optimization, OTOH this thing is really really tricky to use,
> > > and has lead to a number of bugs already.
> >
> > I do agree, it is a bit sad to see these optimizations go. So, should
> > this make mainline, I will be tagging the commits that spin_unlock_wait()
> > so that they can be easily reverted should someone come up with good
> > semantics and a compelling use case with compelling performance benefits.
> Ha!, but what would constitute 'good semantics' ?
> The current thing is something along the lines of:
> "Waits for the currently observed critical section
> to complete with ACQUIRE ordering such that it will observe
> whatever state was left by said critical section."
> With the 'obvious' benefit of limited interference on those actually
> wanting to acquire the lock, and a shorter wait time on our side too,
> since we only need to wait for completion of the current section, and
> not for however many contender are before us.

There's another, probably just as significant advantage: queued_spin_unlock_wait()
is 'read-only', while spin_lock()+spin_unlock() dirties the lock cache line. On
any bigger system this should make a very measurable difference - if
spin_unlock_wait() is ever used in a performance critical code path.

> Not sure I have an actual (micro) benchmark that shows a difference
> though.

It should be pretty obvious from pretty much any profile, the actual lock+unlock
sequence that modifies the lock cache line is essentially a global cacheline

> Is this all good enough to retain the thing, I dunno. Like I said, I'm
> conflicted on the whole thing. On the one hand its a nice optimization, on the
> other hand I don't want to have to keep fixing these bugs.

So on one hand it's _obvious_ that spin_unlock_wait() is both faster on the local
_and_ the remote CPUs for any sort of use case where performance matters - I don't
even understand how that can be argued otherwise.

The real question, does any use-case (we care about) exist.

Here's a quick list of all the use cases:


- This is I believe the 'original', historic spin_unlock_wait() usecase that
still exists in the kernel. spin_unlock_wait() is only used in a rare case,
when the netfilter hash is resized via nf_conntrack_hash_resize() - which is
a very heavy operation to begin with. It will no doubt get slower with the
proposed changes, but it probably does not matter. A networking person
Acked-by would be nice though.


- Locking of the ATA port in ata_scsi_cmd_error_handler(), presumably this can
race with IRQs and ioctls() on other CPUs. Very likely not performance
sensitive in any fashion, on IO errors things stop for many seconds anyway.


- A rare race condition branch in the SysV IPC semaphore freeing code in
exit_sem() - where even the main code flow is not performance sensitive,
because typical database workloads get their semaphore arrays during startup
and don't ever do heavy runtime allocation/freeing of them.


- completion_done(). This is actually a (comparatively) rarely used completion
API call - almost all the upstream usecases are in drivers, plus two in
filesystems - neither usecase seems in a performance critical hot path.
Completions typically involve scheduling and context switching, so in the
worst case the proposed change adds overhead to a scheduling slow path.

So I'd argue that unless there's some surprising performance aspect of a
completion_done() user, the proposed changes should not cause any performance

In fact I'd argue that any future high performance spin_unlock_wait() user is
probably better off open coding the unlock-wait poll loop (and possibly thinking
hard about eliminating it altogether). If such patterns pop up in the kernel we
can think about consolidating them into a single read-only primitive again.

I.e. I think the proposed changes are doing no harm, and the unavailability of a
generic primitive does not hinder future optimizations either in any significant