Re: [RFC] arm64: Enforce observed order for spinlock and data

From: bdegraaf
Date: Wed Oct 05 2016 - 11:30:17 EST

On 2016-10-05 11:10, Peter Zijlstra wrote:
On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegraaf@xxxxxxxxxxxxxx wrote:
On 2016-10-04 15:12, Mark Rutland wrote:
>Hi Brent,
>Could you *please* clarify if you are trying to solve:
>(a) a correctness issue (e.g. data corruption) seen in practice.
>(b) a correctness issue (e.g. data corruption) found by inspection.
>(c) A performance issue, seen in practice.
>(d) A performance issue, found by inspection.
>Any one of these is fine; we just need to know in order to be able to
>help effectively, and so far it hasn't been clear.

Brent, you forgot to state which: 'a-d' is the case here.

I found the problem.

Back in September of 2013, arm64 atomics were broken due to missing barriers
in certain situations, but the problem at that time was undiscovered.

Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in at
time and changed the correct cmpxchg64 in lockref.c to cmpxchg64_relaxed.

d2212b4 appeared to be OK at that time because the additional barrier
requirements of this specific code sequence were not yet discovered, and
this change was consistent with the arm64 atomic code of that time.

Around February of 2014, some discovery led Will to correct the problem with
the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, which
has an excellent explanation of potential ordering problems with the same
code sequence used by lockref.c.

With this updated understanding, the earlier commit
(d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.

Because acquire/release semantics are insufficient for the full ordering,
the single barrier after the store exclusive is the best approach, similar
to Will's atomic barrier fix.

This again does not in fact describe the problem.

What is the problem with lockref, and how (refer the earlier a-d
multiple choice answer) was this found.

Now, I have been looking, and we have some idea what you _might_ be
alluding to, but please explain which accesses get reordered how and
cause problems.

Sorry for the confusion, this was a "b" item (correctness fix based on code
inspection. I had sent an answer to this yesterday, but didn't realize that
it was in a separate, private email thread.

I'll work out the before/after problem scenarios and send them along once
I've hashed them out (it may take a while for me to paint a clear picture).
In the meantime, however, consider that even without the spinlock code in
the picture, lockref needs to treat the cmpxchg as a full system-level atomic,
because multiple agents could access the value in a variety of timings. Since
atomics similar to this are barriered on arm64 since 8e86f0b, the access to
lockref should be similar.