Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

From: Gautham R Shenoy
Date: Wed Jan 09 2019 - 01:08:35 EST


Hello Thiago,

Wish you a happy 2019!

On Sat, Dec 08, 2018 at 12:40:52AM -0200, Thiago Jung Bauermann wrote:
>
> Gautham R Shenoy <ego@xxxxxxxxxxxxxxxxxx> writes:
> > On Fri, Dec 07, 2018 at 04:13:11PM +0530, Gautham R Shenoy wrote:
> >> Sure. I will test the patch and report back.
> >
> > I added the following debug patch on top of your patch, and after an
> > hour's run, the system crashed. Appending the log at the end.
>
> Thank you very much for testing! Your debug patch was very helpful as
> well.
>
> > I suppose we still need to increase the number of tries since we wait
> > only for 2.5ms looping before giving up.
>
> Do you think it would have helped? In the debug output you posted I
> would have expected the following message to show up if the loop
> finished too early, and it didn't:
>
> "Querying DEAD? cpu %i (%i) shows %i\n"
>
> So I don't think increasing the loop length would have made a
> difference. In fact, the call to smp_query_cpu_stopped() always
> succeeded on the first iteration.

I did some testing during the holidays. Here are the observations:

1) With just your patch (without any additional debug patch), if I run
DLPAR on /off operations on a system that has SMT=off, I am able to
see a crash involving RTAS stack corruption within an hour's time.

2) With the debug patch (appended below) which has additional debug to
capture the callers of stop-self, start-cpu, set-power-levels, the
system is able to perform DLPAR on/off operations on a system with
SMT=off for three days. And then, it crashed with the dead CPU showing
a "Bad kernel stack pointer". From this log, I can clearly
see that there were no concurrent calls to stop-self, start-cpu,
set-power-levels. The only concurrent RTAS calls were the dying CPU
calling "stop-self", and the CPU running the DLPAR operation calling
"query-cpu-stopped-state". The crash signature is appended below as
well.

3) Modifying your patch to remove the udelay and increase the loop
count from 25 to 1000 doesn't improve the situation. We are still able
to see the crash.

4) With my patch, even without any additional debug, I was able to
observe the system run the tests successfully for over a week (I
started the tests before the Christmas weekend, and forgot to turn it
off!)

It appears that there is a narrow race window involving rtas-stop-self
and rtas-query-cpu-stopped-state calls that can be observed with your
patch. Adding any printk's seems to reduce the probability of hitting
this race window. It might be worth the while to check with RTAS
folks, if they suspect something here.

The Crash log with this debug patch is as follows

[DEBUG] CPU 32 waiting for CPU 130 to enter rtas
cpu 130 (hwid 130) Ready to die...
[DEBUG] CPU 32 noticed CPU 130 enter rtas: tries=0, time=539
[DEBUG] CPU 32 waiting for CPU 131 to enter rtas
cpu 131 (hwid 131) Ready to die...
[DEBUG] CPU 32 noticed CPU 131 enter rtas: tries=0, time=137
[DEBUG] CPU 32 waiting for CPU 132 to enter rtas
cpu 132 (hwid 132) Ready to die...
[DEBUG] CPU 32 noticed CPU 132 enter rtas: tries=0, time=1238
[DEBUG] CPU 32 waiting for CPU 133 to enter rtas
cpu 133 (hwid 133) Ready to die...
[DEBUG] CPU 32 noticed CPU 133 enter rtas: tries=1, time=1259
[DEBUG] CPU 32 waiting for CPU 134 to enter rtas
cpu 134 (hwid 134) Ready to die...
[DEBUG] CPU 32 noticed CPU 134 enter rtas: tries=0, time=1141
[DEBUG] CPU 32 waiting for CPU 135 to enter rtas
cpu 135 (hwid 135) Ready to die...
[DEBUG] CPU 32 noticed CPU 135 enter rtas: tries=0, time=636
cpu 120 (hwid 120) Ready to die...
[DEBUG] CPU 32 waiting for CPU 120 to enter rtas
[DEBUG] CPU 32 noticed CPU 120 enter rtas: tries=0, time=53
[DEBUG] CPU 32 waiting for CPU 121 to enter rtas
cpu 121 (hwid 121) Ready to die...
Bad kernel stack pointer 1fafb400 at 1fafb4b0
Bad kernel stack pointer 1fafb4b0 at 1ec15270
Oops: Bad kernel stack pointer, sig: 6 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in:
CPU: 121 PID: 0 Comm: swapper/121 Not tainted 4.20.0-thiago-original-with-debug+ #57
NIP: 000000001ec15270 LR: 000000001ec1526c CTR: 000000001ecd26c4
REGS: c00000001e577d30 TRAP: 0300 Not tainted (4.20.0-thiago-original-with-debug+)
MSR: 8000000000001000 <SF,ME> CR: 48000000 XER: 00000020
CFAR: 000000001ecd27a8 DAR: ffffffffffff8108 DSISR: 40000000 IRQMASK: 8000000000009033
GPR00: 000000001ec1526c 000000001fafb4b0 0000000000000000 0000000000000000
GPR04: 0000000001a40968 00000000000000e0 000000000000dfe8 0000000000000018
GPR08: 000000001f82ae00 000000001ed025d4 000000001f827850 0000000000000000
GPR12: 0000000001b6d998 c00000001eb4e080 c0000003a1bdbf90 000000001eea3020
GPR16: 0000000000000000 c0000006fdc45100 c00000000004c8b0 c0000000013d5300
GPR20: c0000006fdc45300 0000000000000008 c0000000019d2cf8 c0000000013d6888
GPR24: 0000000000000079 c0000000013d688c 0000000000000002 c0000000013d688c
GPR28: c0000000019cecf0 00000000000003c8 0000000000000000 000000001fafb4b0
NIP [000000001ec15270] 0x1ec15270
LR [000000001ec1526c] 0x1ec1526c
Call Trace:
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
---[ end trace dd774c4912df6c89 ]---


The debug patch over your patch is as follows:

-----------------------x8-------------------------------------------