Re: [RFC] Disable lockref on arm64

From: Linus Torvalds
Date: Thu May 02 2019 - 12:13:01 EST


On Thu, May 2, 2019 at 1:27 AM Jan Glauber <jglauber@xxxxxxxxxxx> wrote:
>
> I'll see how x86 runs the same testcase, I thought that playing
> cacheline ping-pong is not the optimal use case for any CPU.

Oh, ping-pong is always bad.

But from past experience, x86 tends to be able to always do tight a
cmpxchg loop without failing more than a once or twice, which is all
you need for things like this.

And it's "easy" to do in hardware on a CPU: all you need to do is
guarantee that when you have a cmpxchg loop, the cacheline is sticky
enough that it stays around at the local CPU for the duration of one
loop entry (ie from one cmpxchg to the next).

Obviously you can do that wrong too, and make cachelines *too* sticky,
and then you get fairness issues.

But it really sounds like what happens for your ThunderX2 case, the
different CPU's steal each others cachelines so quickly that even when
you get the cacheline, you don't then get to update it.

Does ThunderX2 do LSE atomics? Are the acquire/release versions really
slow, perhaps, and more or less serializing (maybe it does the
"release" logic even when the store _fails_?), so that doing two
back-to-back cmpxchg ends up taking the core a "long" time, so that
the cache subsystem then steals it easily in between cmpxchg's in a
loop? Does the L1 cache maybe have no way to keep a line around from
one cmpxchg to the next?

This is (one example) where having a CPU and an interconnect that
works together matters. And yes, it probably needs a few generations
of hardware tuning where people see problems and fix them.

Linus