Re: [EXT] Re: [RFC] Disable lockref on arm64

From: Jayachandran Chandrasekharan Nair
Date: Mon May 06 2019 - 02:14:17 EST

On Fri, May 03, 2019 at 12:40:34PM -0700, Linus Torvalds wrote:
> On Thu, May 2, 2019 at 4:19 PM Jayachandran Chandrasekharan Nair
> <jnair@xxxxxxxxxxx> wrote:
> >>
> > I don't really see the point your are making about hardware. If you
> > look at the test case, you have about 64 cores doing CAS to the same
> > location. At any point one of them will succeed and the other 63 will
> > fail - and in our case since cpu_relax is a nop, they sit in a tight
> > loop mostly failing.
> No.
> My point is that the others will *not* fail, if your cache coherency acts sane.
> Here's the deal: with a cmpxchg loop, no cacheline should *ever* be in
> shared mode as part of the loop. Agreed? Even if the cmpxchg is done
> with ldx/stx, the ldx should do a read-for-write cycle, so at no
> single time will you ever have a shared cacheline.
> And once one CPU gets ownership of the line, it doesn't lose it
> immediately, so the next cmpxchg will *succeed*.
> So at most, the *first* cmpxchg will fail (because that's the one that
> was fed not by a previous cmpxchg, but by a regular load (which we'd
> *like* to do as a "load-for-ownership" load, but we don't have the
> interfaces to do that). But the second cmpxchg should basically always
> succeed, unless something exceptional happened (maybe an interrupt,
> maybe something big like that).
> Ergo: if you have a case of failing cmpxchg a lot, your cache
> coherency is simply bad. Your hardware people should be ashamed of
> themselves for letting go of the cacheline without just letting the
> next cmpxchg succeed.
> Notice how there is *NO* ping-pong. Sure, the cacheline moves around,
> but every time it moves around just once, a thread makes progress.
> None of this "for every progrress, there are 63 threads that fail"
> garbage that you're claiming is normal.
> It's not normal, and it's not inevitable.

If you look at the code, the CAS failure is followed by a yield
before retrying the CAS. Yield on arm64 is expected to be a hint
to release resources so that other threads/cores can make progress.
Under heavy contention, I expect the current code to behave the way
I noted in my last mail, with the issue with fairness as well.

Perhaps someone from ARM can chime in here how the cas/yield combo
is expected to work when there is contention. ThunderX2 does not
do much with the yield, but I don't expect any ARM implementation
to treat YIELD as a hint not to yield, but to get/keep exclusive
access to the last failed CAS location.

> If it really happens, it's a sign of bad hardware. Just own it, and
> talk to the hw people, and make sure it gets fixed in ThunderX3. Ok?

Also, I tested a the lockref code on a fairly high core count x86
system with SMT. The worst case number of loops taken is higher than
your guaranteed random number of 15, but the average number of loops
is to be fairly low (about 3-4, and double that for SMT). On x86,
I suppose there has been some coevolution between the software and
hardware on locking with cmpxchg and pause, so by now both are
optimized for each other.

Your larger point seems to be that the hardware has smarter to
scale standard locking implementations when adding cores, and
be graceful even in extremely high contention cases. Yes, this
is something we should be looking at for ThunderX3.

This whole discussion has been difficult since this has nothing to
do with the core capability which you originally talked about. There
are quite a few low-powered ARM64 cores (some of them in server space
too), but ThunderX2 is certainly not one. I say this from first hand
experience from using a ThunderX2 workstation as my primary system
for a while now. Kernel builds, git operations and running multiple
VMs work extremely well and are pretty fast compared to my earlier
x86 based system.

Anyway, I will talk to hardware folks on locking patterns and see
what can be done about cas & yield in ThunderX3. Thanks for your