Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

From: Thiago Jung Bauermann
Date: Thu Dec 06 2018 - 12:28:35 EST



Hello Gautham,

Gautham R. Shenoy <ego@xxxxxxxxxxxxxxxxxx> writes:

> From: "Gautham R. Shenoy" <ego@xxxxxxxxxxxxxxxxxx>
>
> Currently running DLPAR offline/online operations in a loop on a
> POWER9 system with SMT=off results in the following crash:
>
> [ 223.321032] cpu 112 (hwid 112) Ready to die...
> [ 223.355963] Querying DEAD? cpu 113 (113) shows 2
> [ 223.356233] cpu 114 (hwid 114) Ready to die...
> [ 223.356236] cpu 113 (hwid 113) Ready to die...
> [ 223.356241] Bad kernel stack pointer 1faf6ca0 at 1faf6d50
> [ 223.356243] Oops: Bad kernel stack pointer, sig: 6 [#1]
> [ 223.356244] LE SMP NR_CPUS=2048 NUMA pSeries
> [ 223.356247] Modules linked in:
> [ 223.356255] CPU: 114 PID: 0 Comm: swapper/114 Not tainted 4.20.0-rc3 #39
> [ 223.356258] NIP: 000000001faf6d50 LR: 000000001faf6d50 CTR: 000000001ec6d06c
> [ 223.356259] REGS: c00000001e5cbd30 TRAP: 0700 Not tainted (4.20.0-rc3)
> [ 223.356260] MSR: 8000000000081000 <SF,ME> CR: 28000004 XER: 00000020
> [ 223.356263] CFAR: 000000001ec98590 IRQMASK: 8000000000009033
> GPR00: 000000001faf6d50 000000001faf6ca0 000000001ed1c448 00000267e6a273c3
> GPR04: 0000000000000000 00000000000000e0 000000000000dfe8 000000001faf6d30
> GPR08: 000000001faf6d28 00000267e6a273c3 000000001ec1b108 0000000000000000
> GPR12: 0000000001b6d998 c00000001eb55080 c0000003a1b8bf90 000000001eea3f40
> GPR16: 0000000000000000 c0000006fda85100 c00000000004c8b0 c0000000013d5300
> GPR20: c0000006fda85300 0000000000000008 c0000000019d2cf8 c0000000013d6888
> GPR24: 0000000000000072 c0000000013d688c 0000000000000002 c0000000013d688c
> GPR28: c0000000019cecf0 0000000000000390 0000000000000000 000000001faf6ca0
> [ 223.356281] NIP [000000001faf6d50] 0x1faf6d50
> [ 223.356281] LR [000000001faf6d50] 0x1faf6d50
> [ 223.356282] Call Trace:
> [ 223.356283] Instruction dump:
> [ 223.356285] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> [ 223.356286] XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
> [ 223.356290] ---[ end trace f46a4e046b564d1f ]---
>
> This is due to multiple offlined CPUs (CPU 113 and CPU 114 above)
> concurrently (within 3us) trying to make the rtas-call with the
> "stop-self" token, something that is prohibited by the PAPR.
>
> The concurrent calls can happen as follows.
>
> o In dlpar_offline_cpu() we prod an offline CPU X (offline due to
> SMT=off) and loop for 25 tries in pseries_cpu_die() querying if
> the target CPU X has been stopped in RTAS. After 25 tries, we
> prints the message "Querying DEAD? cpu X (X) shows 2" and return
> to dlpar_offline_cpu(). Note that at this point CPU X has not yet
> called rtas with the "stop-self" token, but can do so anytime now.
>
> o Back in dlpar_offline_cpu(), we prod the next offline CPU Y. CPU Y
> promptly wakes up and calls RTAS with the "stop-self" token.
>
> o Before RTAS can stop CPU Y, CPU X also calls RTAS with the
> "stop-self" token.
>
> The problem occurs due to the short number of tries (25) in
> pseries_cpu_die() which covers 90% of the cases. For the remaining 10%
> of the cases, it was observed that we would need to loop for a few
> hundred iterations before the target CPU calls rtas with "stop-self"
> token.Experiments show that each try takes roughly ~25us.
>
> In this patch we fix the problem by increasing the number of tries
> from 25 to 4000 (which roughly corresponds to 100ms) before bailing
> out and declaring that we have failed to observe the target CPU call
> rtas-stop-self. This provides sufficient serialization to ensure that
> there no concurrent rtas calls with "stop-self" token.
>
>
>
> Reported-by: Michael Bringmann <mwb@xxxxxxxxxxxxxxxxxx>
> Signed-off-by: Gautham R. Shenoy <ego@xxxxxxxxxxxxxxxxxx>
> ---
> arch/powerpc/platforms/pseries/hotplug-cpu.c | 13 ++++++++++++-
> 1 file changed, 12 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index 2f8e621..c913c44 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -214,14 +214,25 @@ static void pseries_cpu_die(unsigned int cpu)
> msleep(1);
> }
> } else if (get_preferred_offline_state(cpu) == CPU_STATE_OFFLINE) {
> + int max_tries = 4000; /* Roughly corresponds to 100ms */
> + u64 tb_before = mftb();
>
> - for (tries = 0; tries < 25; tries++) {
> + for (tries = 0; tries < max_tries; tries++) {
> cpu_status = smp_query_cpu_stopped(pcpu);
> if (cpu_status == QCSS_STOPPED ||
> cpu_status == QCSS_HARDWARE_ERROR)
> break;
> cpu_relax();
> }
> +
> + if (tries == max_tries) {
> + u64 time_elapsed_us = div_u64(mftb() - tb_before,
> + tb_ticks_per_usec);
> +
> + pr_warn("Offlined CPU %d isn't stopped by RTAS after %llu us\n",
> + cpu, time_elapsed_us);
> + WARN_ON(1);
> + }
> }
>
> if (cpu_status != 0) {

I posted a similar patch last year, but I wasn't able to arrive at a
root cause analysis like you did:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2017-February/153734.html

One thing I realized after I posted the patch was that in my case, the
CPU was crashing inside RTAS. From the NIP and LR in the trace above it
looks like it's crashing in RTAS in your case as well.

Michael Ellerman had two comments on my patch:

1. Regardless of the underlying bug, the kernel shouldn't crash so we
need a patch making it more resilient to this failure.

2. The wait loop should use udelay() so that the loop will actually take
a set amount of wall time, rather than just cycles.

Regarding 1. if the problem is that the kernel is causing RTAS to crash
because it calls it in a way that's unsupported, then I don't see how we
can make the kernel more resilient. We have to make the kernel respect
RTAS' restrictions (or alternatively, poke RTAS devs to make RTAS fail
gracefuly in these conditions).

Regarding 2. I implemented a new version of my patch (posted below) but
I was never able to test it because I couldn't access a system where the
problem was reproducible anymore.

There's also a race between the CPU driving the unplug and the CPU being
unplugged which I think is not easy for the CPU being unplugged to win,
which makes the busy loop in pseries_cpu_die() a bit fragile. I describe
the race in the patch description.

My solution to make the race less tight is to make the CPU driving the
unplug to only start the busy loop only after the CPU being unplugged is
in the CPU_STATE_OFFLINE state. At that point, we know that it either is
about to call RTAS or it already has.

Do you think this makes sense? If you do, would you mind testing my
patch? You can change the timeouts and delays if you want. To be honest
they're just guesses on my part...