Re: [RFC] Disable lockref on arm64

From: Jayachandran Chandrasekharan Nair
Date: Thu May 02 2019 - 19:19:48 EST


On Thu, May 02, 2019 at 09:12:18AM -0700, Linus Torvalds wrote:
> On Thu, May 2, 2019 at 1:27 AM Jan Glauber <jglauber@xxxxxxxxxxx> wrote:
> >
> > I'll see how x86 runs the same testcase, I thought that playing
> > cacheline ping-pong is not the optimal use case for any CPU.
>
> Oh, ping-pong is always bad.
>
> But from past experience, x86 tends to be able to always do tight a
> cmpxchg loop without failing more than a once or twice, which is all
> you need for things like this.

I don't really see the point your are making about hardware. If you
look at the test case, you have about 64 cores doing CAS to the same
location. At any point one of them will succeed and the other 63 will
fail - and in our case since cpu_relax is a nop, they sit in a tight
loop mostly failing.

And further due to the nature of the test case, the successful thread
will come back almost immediately with another CAS.

> And it's "easy" to do in hardware on a CPU: all you need to do is
> guarantee that when you have a cmpxchg loop, the cacheline is sticky
> enough that it stays around at the local CPU for the duration of one
> loop entry (ie from one cmpxchg to the next).
>
> Obviously you can do that wrong too, and make cachelines *too* sticky,
> and then you get fairness issues.

This is certainly not the case, we are not bouncing around not making
progress at all. We have all 64 cores hitting the same location in a
very tight loop slowing the system down. And you will get fairness
issues anyway about which of the failing cores succeeds next.

The testcase does not hang indefinitely, it eventually completes. The
scaling loss is, in my opinion, due to the naive lockref implementation,
rather than due to a hardware limitation.

Are you expecting the hardware cache coherency implementation to have
the equivalent of queued locks and block potentially failing CAS?

After speaking to the folks doing performance comparisons here, x86
suffers in the same test case as well, when there are enough cores.

Your patch that switches to spinlock (in this case queued) works nicely
in case of high contention. Is this something that will be merged to
mainline? We can provide some testing results if that will help.

> But it really sounds like what happens for your ThunderX2 case, the
> different CPU's steal each others cachelines so quickly that even when
> you get the cacheline, you don't then get to update it.
>
> Does ThunderX2 do LSE atomics? Are the acquire/release versions really
> slow, perhaps, and more or less serializing (maybe it does the
> "release" logic even when the store _fails_?), so that doing two
> back-to-back cmpxchg ends up taking the core a "long" time, so that
> the cache subsystem then steals it easily in between cmpxchg's in a
> loop? Does the L1 cache maybe have no way to keep a line around from
> one cmpxchg to the next?

ThunderX2 has LSE atomics. It has also full out-of-order execution with
weak ordering for load/store, so a barrier will be slow down exection.

Also to address some points on the earlier rant: ThunderX2 is a fairly
beefy processor (based on Broadcom Vulcan), it compares well on per-core
performance with x86 (and with A76 from what I hear even though A76 was
out a few years later). There are more cores per socket due to the fact
that there is no ISA baggage, and not really due to a weaker core.

> This is (one example) where having a CPU and an interconnect that
> works together matters. And yes, it probably needs a few generations
> of hardware tuning where people see problems and fix them.

The next generation ThunderX3 is in the works, and it will have even
more cores, it is going to be fun :)

JC