Re: [RFC] Disable lockref on arm64

From: Linus Torvalds
Date: Wed May 01 2019 - 12:41:48 EST


On Mon, Apr 29, 2019 at 7:52 AM Jan Glauber <jglauber@xxxxxxxxxxx> wrote:
>
> It turned out the issue we have on ThunderX2 is the file open-close sequence
> with small read sizes. If the used files are opened read-only the
> lockref code (enabled by ARCH_USE_CMPXCHG_LOCKREF) is used.
>
> The lockref CMPXCHG_LOOP uses an unbound (as long as the associated
> spinlock isn't taken) while loop to change the lock count. This behaves
> badly under heavy contention

Ok, excuse me when I rant a bit.

Since you're at Marvell, maybe you can forward this rant to the proper
guilty parties?

Who was the absolute *GENIUS* who went

Step 1: "Oh, we have a middling CPU that isn't world-class on its own"

Step 2: "BUT! We can put a lot of them on a die, because that's 'easy'"

Step 3: "But let's make sure the interconnect isn't all that special,
because that would negate the the whole 'easy' part, and really strong
interconnects are even harder than CPU's and use even more power, so
that wouldn't work"

Step 4: "I wonder why this thing scales badly?"

Seriously. Why are you guys doing this? Has nobody ever looked at the
fundamental thought process above and gone "Hmm"?

If you try to compensate for a not-great core by putting twice the
number of them in a system, you need a cache system and interconnect
between them that is more than twice as good as the competition.

And honestly, from everything that I hear, you don't have it. The
whole chip is designed for "throughput when there is no contention".
Is it really a huge surprise that it then falls flat on its face when
there's something fancy going on?

So now you want to penalize everybody else in the ARM community
because you have a badly balanced system?

Ok, rant over.

The good news is that we can easily fix _this_ particular case by just
limiting the CMPXCHG_LOOP to a maximum number of retries, since the
loop is already designed to fail quickly if the spin lock value isn't
unlocked, and all the lockref code is already organized to fall back
to spinlocks.

So the attached three-liner patch may just work for you. Once _one_
thread hits the maximum retry case and goes into the spinlocked case,
everybody else will also fall back to spinlocks because they now see
that the lockref is contended. So the "retry" value probably isn't all
that important, but let's make it big enough that it probably never
happens on a well-balanced system.

But seriously: the whole "let's just do lots of CPU cores because it's
easy" needs to stop. It's fine if you have a network processor and
you're doing independent things, but it's not a GP processor approach.

Your hardware people need to improve on your CPU core (maybe the
server version of Cortex A76 is starting to approach being good
enough?) and your interconnect (seriously!) instead of just slapping
32 cores on a die and calling it a day.

Linus "not a fan of the flock of chickens" Torvalds
lib/lockref.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/lib/lockref.c b/lib/lockref.c
index 3d468b53d4c9..a6762f8f45c9 100644
--- a/lib/lockref.c
+++ b/lib/lockref.c
@@ -9,6 +9,7 @@
* failure case.
*/
#define CMPXCHG_LOOP(CODE, SUCCESS) do { \
+ int retry = 15; /* Guaranteed random number */ \
struct lockref old; \
BUILD_BUG_ON(sizeof(old) != 8); \
old.lock_count = READ_ONCE(lockref->lock_count); \
@@ -21,6 +22,8 @@
if (likely(old.lock_count == prev.lock_count)) { \
SUCCESS; \
} \
+ if (!--retry) \
+ break; \
cpu_relax(); \
} \
} while (0)