Re: cpu stopper threads and load balancing leads to deadlock
From: Peter Zijlstra
Date: Thu May 03 2018 - 08:28:23 EST
On Thu, May 03, 2018 at 02:12:22PM +0200, Mike Galbraith wrote:
> [ 124.216939] =============================
> [ 124.216939] WARNING: suspicious RCU usage
> [ 124.216941] 4.17.0.g66d489e-tip-default #82 Tainted: G E
> [ 124.216941] -----------------------------
> [ 124.216943] kernel/sched/core.c:1614 suspicious rcu_dereference_check() usage!
> [ 124.216944]
> other info that might help us debug this:
>
> [ 124.216945]
> RCU used illegally from offline CPU!
> rcu_scheduler_active = 2, debug_locks = 0
> [ 124.216946] 4 locks held by swapper/2/0:
> [ 124.216947] #0: 000000001f9fa447 (stop_cpus_mutex){+.+.}, at: stop_machine_from_inactive_cpu+0x86/0x130
> [ 124.216953] #1: 000000004cb07b3b (&stopper->lock){..-.}, at: cpu_stop_queue_work+0x2d/0x80
> [ 124.216958] #2: 00000000d3a46b90 (&p->pi_lock){-.-.}, at: try_to_wake_up+0x2d/0x5f0
> [ 124.216964] #3: 00000000f360767b (rcu_read_lock){....}, at: rcu_read_lock+0x0/0x80
> [ 124.216969]
> stack backtrace:
> [ 124.216971] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G E 4.17.0.g66d489e-tip-default #82
> [ 124.216972] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
> [ 124.216973] Call Trace:
> [ 124.216977] dump_stack+0x78/0xb3
> [ 124.216979] ttwu_stat+0x121/0x130
> [ 124.216983] try_to_wake_up+0x2c2/0x5f0
> [ 124.216988] ? cpu_stop_park+0x30/0x30
> [ 124.216990] cpu_stop_queue_work+0x7c/0x80
> [ 124.216993] queue_stop_cpus_work+0x61/0xb0
> [ 124.216997] stop_machine_from_inactive_cpu+0xd3/0x130
> [ 124.216999] ? mtrr_restore+0x80/0x80
> [ 124.217005] mtrr_ap_init+0x62/0x70
> [ 124.217008] identify_secondary_cpu+0x18/0x80
> [ 124.217011] smp_store_cpu_info+0x44/0x50
> [ 124.217014] start_secondary+0x9a/0x1e0
> [ 124.217017] secondary_startup_64+0xa5/0xb0
Hurm.. I don't see how this is 'new'. We moved the wakeup out from under
stopper lock, but that should not affect the RCU state.
The warning is of course valid, stop_machine_from_inactive_cpu()
explicitly run on an 'offline' CPU. The patch didn't change this.