Re: [PATCH 4.18 000/123] 4.18.6-stable review

From: Guenter Roeck
Date: Sat Sep 08 2018 - 23:58:31 EST

On 09/05/2018 10:01 AM, Linus Torvalds wrote:
On Wed, Sep 5, 2018 at 8:34 AM Guenter Roeck <linux@xxxxxxxxxxxx> wrote:

On 09/05/2018 02:01 AM, Greg Kroah-Hartman wrote:
[ 9990.754641] watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [kworker/5:1:155]
[ 9990.762601] RIP: 0010:smp_call_function_many+0x208/0x270
[ 9990.762601] Code: e8 0d d1 77 00 3b 05 cb f0 24 01 0f 83 86 fe ff ff 48 63 d0 49 8b 0c 24 48 03 0c d5 00 f7 11 a7 8b 51 18 83 e2 01 74 0a f3 90 <8b> 51 18 83 e2 01 75 f6 eb c7 0f b6 4d d0 4c 89 f2 4c 89 ee 44 89

It's stuck in this loop:

mov 0x18(%rcx),%edx
and $0x1,%edx
jne loop

which is csd_lock_wait().

Judging by the offset in smp_call_function_many(), it's the final one
(there's two: the other one is part of "csd_lock()"). But that's just
a guess.

Anyway, it means that we're waiting for another CPU to finish
processing an IPI - either a previous one we sent asynchronously (if
it's the earlier csd_lock() case) or the TLB IPI we just sent and
we're waiting for completion of.

Not tested, but I see it in v4.17.19 and in v4.18.6-rc2. Turns out it is
related to heavy load, not to suspend/resume. At this point I suspect that
it may be an AMD/Ryzen specific problem - it looks like it disappears if I
add "kernel.randomize_va_space = 0" to /etc/sysctl.conf. No idea if it is a
CPU bug or some AMD specific code problem. I'll try to analyze it further.

Ouch. Some IPI sending/receiving problem would be very very painful to
debug if it's hw related.

Turns out this is a well known problem with Ryzen CPUs: