Re: Problem with commit 31e77c93e432 "sched/fair: Update blocked load when newly idle"

From: Vincent Guittot
Date: Wed Mar 28 2018 - 10:02:12 EST


Hi,

On 28 March 2018 at 12:37, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> Hi,
>
> On 03/24/2018 01:47 PM, Heiner Kallweit wrote:
>>
>> Am 24.03.2018 um 07:46 schrieb Vincent Guittot:
>>>
>>> Hi Heiner,
>>>
>>> Le Friday 23 Mar 2018 Ã 22:28:09 (+0100), Heiner Kallweit a Ãcrit :
>>>>
>>>> Recently I started to get the following problems with linux-next:
>>>>
>>>> - When working via Putty/SSH on the system the console frequently
>>>> freezes
>>>> for few seconds. Sometimes only opening a second console makes the
>>>> first one react again.
>>>>
>>>> - I get "INFO: rcu_sched detected stalls on CPUs/tasks:" warnings as
>>>> described in [1].
>>>>
>
> I can't catch this issue on my Juno r0 (arm64 big.Little).
>
> root@juno:~# uname -r
> 4.16.0-rc4-00198-g31e77c93e432
>
> I'm using openssh-client and openssh-server though.

I think that I have finally been able to reproduce it on my hikey
(octo cortex-A53) after unplugging 6 cores and waiting for almost 2
hours
This seems to happen only on dual core system as I haven't faced that
before on the hikey which I have used for my tests

[ 191.365730] CPU2: shutdown
[ 191.368482] psci: CPU2 killed.
[ 195.601017] CPU3: shutdown
[ 195.603767] psci: CPU3 killed.
[ 199.037500] CPU4: shutdown
[ 199.040251] psci: CPU4 killed.
[ 201.813237] CPU5: shutdown
[ 201.815996] psci: CPU5 killed.
[ 204.624902] CPU6: shutdown
[ 204.627646] psci: CPU6 killed.
[ 207.652478] CPU7: shutdown
[ 207.655204] psci: CPU7 killed.
[ 6017.160463] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 6017.166151] 1-...!: (4 GPs behind) idle=e20/0/0 softirq=10820/10864 fqs=0
[ 6017.173113] (detected by 0, t=20705 jiffies, g=1389, c=1388, q=27)
[ 6017.179386] Task dump for CPU 1:
[ 6017.182612] swapper/1 R running task 0 0 1 0x00000000
[ 6017.189666] Call trace:
[ 6017.192120] __switch_to+0x8c/0xd0
[ 6017.195524] cpuidle_enter_state+0x64/0x360
[ 6017.199706] cpuidle_enter+0x18/0x20
[ 6017.203282] call_cpuidle+0x18/0x30
[ 6017.206771] do_idle+0x1a4/0x1e0
[ 6017.209999] cpu_startup_entry+0x20/0x28
[ 6017.213923] secondary_start_kernel+0x188/0x1c8
[ 6017.218457] rcu_preempt kthread starved for 20705 jiffies! g1389
c1388 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
[ 6017.228985] rcu_preempt I 0 8 2 0x00000000
[ 6017.234474] Call trace:
[ 6017.236918] __switch_to+0x8c/0xd0
[ 6017.240322] __schedule+0x1b8/0x730
[ 6017.243810] schedule+0x38/0xa0
[ 6017.246952] schedule_timeout+0x194/0x428
[ 6017.250964] rcu_gp_kthread+0x4d4/0x780
[ 6017.254802] kthread+0xfc/0x128
[ 6017.257942] ret_from_fork+0x10/0x18
[ 6066.541736] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 6066.547428] 1-...!: (5 GPs behind) idle=e28/0/0 softirq=10820/10864 fqs=0
[ 6066.554392] (detected by 0, t=12345 jiffies, g=1390, c=1389, q=48)
[ 6066.560666] Task dump for CPU 1:
[ 6066.563893] swapper/1 R running task 0 0 1 0x00000000
[ 6066.570948] Call trace:
[ 6066.573404] __switch_to+0x8c/0xd0
[ 6066.576809] cpuidle_enter_state+0x64/0x360
[ 6066.580992] cpuidle_enter+0x18/0x20
[ 6066.584568] call_cpuidle+0x18/0x30
[ 6066.588056] do_idle+0x1a4/0x1e0
[ 6066.591284] cpu_startup_entry+0x20/0x28
[ 6066.595208] secondary_start_kernel+0x188/0x1c8
[ 6066.599742] rcu_preempt kthread starved for 12345 jiffies! g1390
c1389 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=0
[ 6066.610270] rcu_preempt I 0 8 2 0x00000000
[ 6066.615758] Call trace:
[ 6066.618203] __switch_to+0x8c/0xd0
[ 6066.621607] __schedule+0x1b8/0x730
[ 6066.625095] schedule+0x38/0xa0
[ 6066.628236] schedule_timeout+0x194/0x428
[ 6066.632249] rcu_gp_kthread+0x4d4/0x780
[ 6066.636087] kthread+0xfc/0x128

>
>>>> Bisecting the issue resulted in:
>>>>
>>>> 31e77c93e432dec79c7d90b888bbfc3652592741 is the first bad commit
>>>> commit 31e77c93e432dec79c7d90b888bbfc3652592741
>>>> Author: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>>>> Date: Wed Feb 14 16:26:46 2018 +0100
>>>>
>>>> sched/fair: Update blocked load when newly idle
>>>>
>>>> When NEWLY_IDLE load balance is not triggered, we might need to
>>>> update the
>>>> blocked load anyway. We can kick an ilb so an idle CPU will take
>>>> care of
>>>> updating blocked load or we can try to update them locally before
>>>> entering
>>>> idle. In the latter case, we reuse part of the nohz_idle_balance.
>>>>
>>>> After reversing this commit at least the issue with the freezing console
>>>> is gone. The second one appeared only sporadically, I still have to see
>>>> whether it pops up again.
>
>
> [...]
>