Re: [PATCH] fix a race condition in cancelable mcs spinlocks

From: James Bottomley
Date: Mon Jun 02 2014 - 15:56:48 EST

Next message: Andi Kleen: "Re: [PATCH 4/4] kvm: Implement PEBS virtualization"
Previous message: Arnd Bergmann: "Re: [RFC 00/32] making inode time stamps y2038 ready"
In reply to: Paul E. McKenney: "Re: [PATCH] fix a race condition in cancelable mcs spinlocks"
Next in thread: Peter Zijlstra: "Re: [PATCH] fix a race condition in cancelable mcs spinlocks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun, 2014-06-01 at 23:30 +0200, Peter Zijlstra wrote:
> On Sun, Jun 01, 2014 at 04:46:26PM -0400, John David Anglin wrote:
> > On 1-Jun-14, at 3:20 PM, Peter Zijlstra wrote:
> >
> > >>If you write to some variable with ACCESS_ONCE and use cmpxchg or xchg
> > >>at
> > >>the same time, you break it. ACCESS_ONCE doesn't take the hashed
> > >>spinlock,
> > >>so, in this case, cmpxchg or xchg isn't really atomic at all.
> > >
> > >And this is really the first place in the kernel that breaks like this?
> > >I've been using xchg() and cmpxchg() without such consideration for
> > >quite a while.
> >
> > I believe Mikulas is correct. Even in a controlled situation where a
> > cmpxchg operation
> > is used to implement pthread_spin_lock() in userspace, we found recently
> > that the lock
> > must be released with a cmpxchg operation and not a simple write on SMP
> > systems.
> > There is a race in the cache operations or instruction ordering that's not
> > present with
> > the ldcw instruction.
>
> Oh, I'm not arguing that. He's quite right that its broken, but this
> form of atomic ops is also quite insane and unusual. Most sane machines
> don't have this problem.
>
> My main concern is how are we going to avoid breaking parisc (and I
> think sparc32, which is similarly retarded) in the future; we should
> invest in machinery to find and detect these things.

Architecturally, there is a way we could emulate the atomic exchange
instructions. We could have a special section of memory that always
triggers a page trap. In the Q state dtlb trap handlers we could
recognise the "atomic" section of memory and wrap the attempted
modification in a semaphore. This would add a bit of overhead, but not
a huge amount if we do it in the trap handlers like the TMPALIAS
flushes. This involves a lot of work for us because we have to decode
the instructions in software, recognise the operations and manually
apply the hashed semaphores around them. If we did it like this, all
we'd need by way of mainline support is that variables treated as
atomically exchangeable should be in a separate section (because it's a
page fault handler effectively, we need them all separated from "normal"
code). This would probably require some type of variable marker and if
we ever saw a xchg or cmpxchg on a variable without the marker, we could
break the build.

The way we'd implement is the memory region would be read and write
protected, so all loads and stores trap to the dtlb absent handlers.
For a ldX instruction, if it were not followed by a stX to the same
location, we'd simply give it the value. For stX followed by ldX for
xchg, we'd take the lock, do the exchange and drop the lock and for stX
not preceded by ldX, we'd take the lock, do the store and drop the lock.
To avoid compromising the protected region, we'd actually back it by a
different area of kernel memory where we make the real modifications,
rather than trying to muck with temporarily inserting a TLB entry. On
return we'd have to nullify the instructions to avoid re-trapping.
Effectively this has us emulating all load and store operations with a
shadow memory region ... if you know the number of possible address
modes for PARISC, you'll realise that's a non-trivial amount of code.
Plus we'd either have to ensure the shadow region had a permanent TLB
entry or do a full fault and exit the TLB handler (we can't take a
nested TLB fault within the TLB fault handler).

Is it worth it ... definitely not if we can just prevent mainline from
using xchg on our architecture.

James

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andi Kleen: "Re: [PATCH 4/4] kvm: Implement PEBS virtualization"
Previous message: Arnd Bergmann: "Re: [RFC 00/32] making inode time stamps y2038 ready"
In reply to: Paul E. McKenney: "Re: [PATCH] fix a race condition in cancelable mcs spinlocks"
Next in thread: Peter Zijlstra: "Re: [PATCH] fix a race condition in cancelable mcs spinlocks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]