Re: [sparc64] locking/atomic, kernel OOPS on running stress-ng

From: Mark Rutland
Date: Mon Jul 05 2021 - 15:57:09 EST


On Mon, Jul 05, 2021 at 06:16:49PM +0300, Anatoly Pugachev wrote:
> Hello!

Hi Anatoly,

> latest sparc64 git kernel produces the following OOPS on running stress-ng as :
>
> $ stress-ng -v --mmap 1 -t 30s
>
> kernel OOPS (console logs):
>
> [ 27.276719] Unable to handle kernel NULL pointer dereference
> [ 27.276782] tsk->{mm,active_mm}->context = 00000000000003cb
> [ 27.276818] tsk->{mm,active_mm}->pgd = fff800003a2a0000
> [ 27.276853] \|/ ____ \|/
> [ 27.276853] "@'/ .. \`@"
> [ 27.276853] /_| \__/ |_\
> [ 27.276853] \__U_/
> [ 27.276927] stress-ng(928): Oops [#1]

I can reproduce this under QEMU; following your bisection (and working
around the missing ifdeferry that breaks bisection), I can confirm that
the first broken commit is:

ff5b4f1ed580 ("locking/atomic: sparc: move to ARCH_ATOMIC")

Sorry about this.

> Can someone please look at this commit ids?

>From digging into this, I can't spot an obvious bug in the commit above.

It looks like this happens when some of the xchg/cmpxchg variants are
wrapped by <asm-generic/atomic-instrumented.h>, but I can't immediately
explain why. This might be a latent bug that's being tickled by the
structure of the wrappers, or some subtlety with the typecasting that
happens in the wrappers.

Starting with:

ff5b4f1ed580 ("locking/atomic: sparc: move to ARCH_ATOMIC")

... and atop that, cherry-picking:

bccf1ec369ac ("locking/atomics: atomic-instrumented: simplify ifdeffery")

... the below hack seems to make the stress-ng run pass without issue,
even after running for multiple minutes (when it would usually fail in a
few seconds).

In case this is a codegen issue, I'm using the kernel.org GCC 10.3.0
cross toolchain.

Thanks,
Mark.

---->8----