Re: [PATCH] pseries/hotplug: Add more delay in pseries_cpu_die while waiting for rtas-stop

From: Michael Bringmann
Date: Mon Dec 10 2018 - 15:16:11 EST


I have asked Scott Mayes to take a look at one of these crashes from
the phyp side. I will let you know if he finds anything notable.

Michael

On 12/07/2018 08:40 PM, Thiago Jung Bauermann wrote:
>
> Gautham R Shenoy <ego@xxxxxxxxxxxxxxxxxx> writes:
>> On Fri, Dec 07, 2018 at 04:13:11PM +0530, Gautham R Shenoy wrote:
>>> Sure. I will test the patch and report back.
>>
>> I added the following debug patch on top of your patch, and after an
>> hour's run, the system crashed. Appending the log at the end.
>
> Thank you very much for testing! Your debug patch was very helpful as
> well.
>
>> I suppose we still need to increase the number of tries since we wait
>> only for 2.5ms looping before giving up.
>
> Do you think it would have helped? In the debug output you posted I
> would have expected the following message to show up if the loop
> finished too early, and it didn't:
>
> "Querying DEAD? cpu %i (%i) shows %i\n"
>
> So I don't think increasing the loop length would have made a
> difference. In fact, the call to smp_query_cpu_stopped() always
> succeeded on the first iteration.
>
> I think there is something else going on which we don't fully understand
> yet. From your other email:
>
>> I agree that the Kernel has to respect RTAS's restriction. The PAPR
>> v2.8.1, Requirement R1-7.2.3-8 under section 7.2.3 says the following:
>>
>> "The stop-self service needs to be serialized with calls to the
>> stop-self, start-cpu, and set-power-level services. The OS must
>> be able to call RTAS services on other processors while the
>> processor is stopped or being stopped"
>>
>> Thus the onus is on the OS to ensure that there are no concurrent rtas
>> calls with "stop-self" token.
>
> As you say perhaps there's another call to stop-self, start-cpu or
> set-power-level being made concurrently. I don't currently see how more
> than one stop-self or start-cpu call could be in flight at the same time
> given that there are a number of locks being grabbed during CPU hotplug
> and unplug. OTOH the CPU that actually calls stop-self doesn't seem to
> grab any locks itself so it's a possibility.
>
> As for set-power-level, it's only used in the case of PCI hotplug from
> what I can see, and that isn't part of the picture in this case, right?
>
> We could address this problem directly by adding another lock separate
> from rtas.lock to serialize just these calls. The challenge is with
> stop-self, because the CPU calling it will never return to release the
> lock. Is it possible to grab a lock (or down a semaphore) in the CPU
> calling stop-self and then release the lock (or up the semaphore) in the
> CPU running pseries_cpu_die()?
>
>>> There's also a race between the CPU driving the unplug and the CPU
>>> being unplugged which I think is not easy for the CPU being
>>> unplugged to win, which makes the busy loop in pseries_cpu_die() a
>>> bit fragile. I describe the race in the patch description.
>>>
>>> My solution to make the race less tight is to make the CPU driving
>>> the unplug to only start the busy loop only after the CPU being
>>> unplugged is in the CPU_STATE_OFFLINE state. At that point, we know
>>> that it either is about to call RTAS or it already has.
>>
>> Ah, yes this is good optimization. Though, I think we ought to
>> unconditionally wait until the target CPU has woken up from CEDE and
>> changed its state to CPU_STATE_OFFLINE. After if PROD failed, then we
>> would have caught it in dlpar_offline_cpu() itself.
>
> I recently saw a QEMU-implemented hcall (H_LOGICAL_CI_LOAD) return
> success when it had been given an invalid memory address to load from,
> so my confidence in the error reporting of hcalls is a bit shaken. :-)
>
> In that case the CPU would wait forever for the CPU state to change. If
> you believe 100 ms is too short a timeout, we could make it 500 ms or
> even 1s. What do you think?
>
>> cpu 112 (hwid 112) Ready to die...
>> [DEBUG] Waited for CPU 112 to enter rtas: tries=0, time=65
>> cpu 113 (hwid 113) Ready to die...
>> [DEBUG] Waited for CPU 113 to enter rtas: tries=0, time=1139
>> cpu 114 (hwid 114) Ready to die...
>> [DEBUG] Waited for CPU 114 to enter rtas: tries=0, time=1036
>> cpu 115 (hwid 115) Ready to die...
>> [DEBUG] Waited for CPU 115 to enter rtas: tries=0, time=133
>> cpu 116 (hwid 116) Ready to die...
>> [DEBUG] Waited for CPU 116 to enter rtas: tries=0, time=1231
>> cpu 117 (hwid 117) Ready to die...
>> [DEBUG] Waited for CPU 117 to enter rtas: tries=0, time=1231
>> cpu 118 (hwid 118) Ready to die...
>> [DEBUG] Waited for CPU 118 to enter rtas: tries=0, time=1231
>> cpu 119 (hwid 119) Ready to die...
>> [DEBUG] Waited for CPU 119 to enter rtas: tries=0, time=1131
>> cpu 104 (hwid 104) Ready to die...
>> [DEBUG] Waited for CPU 104 to enter rtas: tries=0, time=40
>
> Interesting, so 1.2 ms can pass before the dying CPU actually gets close
> to making the stop-self call. And even in those cases the retry loop is
> succeeding on the first try! So this shows that changing the code to
> wait for the CPU_STATE_OFFLINE state is worth it.
>
>> ******* RTAS CALL BUFFER CORRUPTION *******
>> 393: rtas32_call_buff_ptr=
>> 0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
>> 0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
>> 0000 0060 0000 0060 0000 0060 0000 0060 [...`...`...`...`]
>> 0000 0060 0800 E07F ACA7 0000 0000 00C0 [...`............]
>> 2500 0000 0000 0000 0000 0000 0000 0000 [%...............]
>> 0000 0000 0000 0000 0000 0000 306E 7572 [............0nur]
>> 4800 0008 .... .... .... .... .... .... [H...........0nur]
>> 394: rtas64_map_buff_ptr=
>> 0000 0000 5046 5743 0000 0000 4F44 4500 [....PFWC....ODE.]
>> 0000 0000 6000 0000 0000 0000 0000 0069 [....`..........i]
>> 0000 0000 0000 0000 0000 0000 0000 0000 [................]
>> 0000 0000 0000 0005 0000 0000 0000 0001 [................]
>> 0000 0000 1A00 0000 0000 0000 0000 0000 [................]
>> 0000 0000 8018 6398 0000 0000 0300 00C0 [......c.........]
>> 0000 0000 .... .... .... .... .... .... [......c.........]
>
> Ah! I never saw this error message. So indeed the kernel is causing RTAS
> to blow up. Perhaps it would be useful to instrument more RTAS calls
> (especially start-cpu and set-power-level) to see if it's one of them
> that is being called at the time this corruption happens.
>
>> cpu 105 (hwid 105) Ready to die...
>> Bad kernel stack pointer 1fafb6c0 at 0
>> Oops: Bad kernel stack pointer, sig: 6 [#1]
>> LE SMP NR_CPUS=2048 NUMA pSeries
>> Modules linked in:
>> CPU: 105 PID: 0 Comm: swapper/105 Not tainted 4.20.0-rc5-thiago+ #45
>> NIP: 0000000000000000 LR: 0000000000000000 CTR: 00000000007829c8
>> REGS: c00000001e63bd30 TRAP: 0700 Not tainted (4.20.0-rc5-thiago+)
>> MSR: 8000000000081000 <SF,ME> CR: 28000004 XER: 00000010
>> CFAR: 000000001ec153f0 IRQMASK: 8000000000009033
>> GPR00: 0000000000000000 000000001fafb6c0 000000001ec236a0 0000000000000040
>> GPR04: 00000000000000c0 0000000000000080 00046c4fb4842557 00000000000000cd
>> GPR08: 000000000001f400 000000001ed035dc 0000000000000000 0000000000000000
>> GPR12: 0000000000000000 c00000001eb5e480 c0000003a1b53f90 000000001eea3e20
>> GPR16: 0000000000000000 c0000006fd845100 c00000000004c8b0 c0000000013d5300
>> GPR20: c0000006fd845300 0000000000000008 c0000000019d2cf8 c0000000013d6888
>> GPR24: 0000000000000069 c0000000013d688c 0000000000000002 c0000000013d688c
>> GPR28: c0000000019cecf0 0000000000000348 0000000000000000 0000000000000000
>> NIP [0000000000000000] (null)
>> LR [0000000000000000] (null)
>> Call Trace:
>> Instruction dump:
>> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
>> XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX 60000000 60000000 60000000 60000000
>> ---[ end trace 1aa3b4936949457e ]---
>
> Ok, so at about the time CPU 105 makes the stop-self call there is this
> RTAS call buffer corruption and this bad kernel stack pointer in CPU 105.
> We need to understand better what is causing this.
>
>> Bad kernel stack pointer 1fafb4b0 at 1ec15004
>> rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
>> rcu: 88-...!: (0 ticks this GP) idle=2ce/1/0x4000000000000000 softirq=28076/28076 fqs=78
>> rcu: (detected by 72, t=10866 jiffies, g=180529, q=2526)
>> Sending NMI from CPU 72 to CPUs 88:
>> CPU 88 didn't respond to backtrace IPI, inspecting paca.
>> irq_soft_mask: 0x01 in_mce: 0 in_nmi: 0 current: 22978 (drmgr)
>> Back trace of paca->saved_r1 (0xc0000006f94ab750) (possibly stale):
>> Call Trace:
>> [c0000006f94ab750] [c0000006f94ab790] 0xc0000006f94ab790 (unreliable)
>> [c0000006f94ab930] [c0000000000373f8] va_rtas_call_unlocked+0xc8/0xe0
>> [c0000006f94ab950] [c000000000037a98] rtas_call+0x98/0x200
>> [c0000006f94ab9a0] [c0000000000d7d28] smp_query_cpu_stopped+0x58/0xe0
>> [c0000006f94aba20] [c0000000000d9dbc] pseries_cpu_die+0x1ec/0x240
>> [c0000006f94abad0] [c00000000004f284] __cpu_die+0x44/0x60
>> [c0000006f94abaf0] [c0000000000d8e10] dlpar_cpu_remove+0x160/0x340
>> [c0000006f94abbc0] [c0000000000d9184] dlpar_cpu_release+0x74/0x100
>> [c0000006f94abc10] [c000000000025a74] arch_cpu_release+0x44/0x70
>> [c0000006f94abc30] [c0000000009bd1bc] cpu_release_store+0x4c/0x80
>> [c0000006f94abc60] [c0000000009ae000] dev_attr_store+0x40/0x70
>> [c0000006f94abc80] [c000000000495810] sysfs_kf_write+0x70/0xb0
>> [c0000006f94abca0] [c00000000049453c] kernfs_fop_write+0x17c/0x250
>> [c0000006f94abcf0] [c0000000003ccb6c] __vfs_write+0x4c/0x1f0
>> [c0000006f94abd80] [c0000000003ccf74] vfs_write+0xd4/0x240
>> [c0000006f94abdd0] [c0000000003cd338] ksys_write+0x68/0x110
>> [c0000006f94abe20] [c00000000000b288] system_call+0x5c/0x70
>
> So CPU 88 is the one driving the hot unplug and waiting for CPU 105 to
> die. But it is stuck inside RTAS. Perhaps because of the call buffer
> corruption?
>
>> rcu: rcu_sched kthread starved for 10709 jiffies! g180529 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=72
>> rcu: RCU grace-period kthread stack dump:
>> rcu_sched I 0 11 2 0x00000808
>> Call Trace:
>> [c0000000061ab840] [c0000003a4a84800] 0xc0000003a4a84800 (unreliable)
>> [c0000000061aba20] [c00000000001e24c] __switch_to+0x2dc/0x430
>> [c0000000061aba80] [c000000000e5fb94] __schedule+0x3d4/0xa20
>> [c0000000061abb50] [c000000000e6022c] schedule+0x4c/0xc0
>> [c0000000061abb80] [c000000000e64ffc] schedule_timeout+0x1dc/0x4e0
>> [c0000000061abc80] [c0000000001af40c] rcu_gp_kthread+0xc3c/0x11f0
>> [c0000000061abdb0] [c00000000013c7c8] kthread+0x168/0x1b0
>> [c0000000061abe20] [c00000000000b658] ret_from_kernel_thread+0x5c/0x64
>
> I don't know what to make of CPU 72. :-) Perhaps it's the one making
> the other "rogue" RTAS call interfering with stop-self in CPU 105?
>
> It must be some RTAS call made with rtas_call_unlocked, because CPU 88
> is holding the RTAS lock.
>
> -- Thiago Jung Bauermann
> IBM Linux Technology Center
>

--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
mwb@xxxxxxxxxxxxxxxxxx