Re: [PATCH] workqueue: fix rebind bound workers warning
From: Wanpeng Li
Date: Mon May 09 2016 - 03:28:57 EST
Sorry to quick ping you Tejun, just hope it can catch the upcoming
merge window. :-)
2016-05-05 9:41 GMT+08:00 Wanpeng Li <kernellwp@xxxxxxxxx>:
> From: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 16 at kernel/workqueue.c:4559 rebind_workers+0x1c0/0x1d0
> Modules linked in:
> CPU: 0 PID: 16 Comm: cpuhp/0 Not tainted 4.6.0-rc4+ #31
> Hardware name: IBM IBM System x3550 M4 Server -[7914IUW]-/00Y8603, BIOS -[D7E128FUS-1.40]- 07/23/2013
> 0000000000000000 ffff881037babb58 ffffffff8139d885 0000000000000010
> 0000000000000000 0000000000000000 0000000000000000 ffff881037babba8
> ffffffff8108505d ffff881037ba0000 000011cf3e7d6e60 0000000000000046
> Call Trace:
> dump_stack+0x89/0xd4
> __warn+0xfd/0x120
> warn_slowpath_null+0x1d/0x20
> rebind_workers+0x1c0/0x1d0
> workqueue_cpu_up_callback+0xf5/0x1d0
> notifier_call_chain+0x64/0x90
> ? trace_hardirqs_on_caller+0xf2/0x220
> ? notify_prepare+0x80/0x80
> __raw_notifier_call_chain+0xe/0x10
> __cpu_notify+0x35/0x50
> notify_down_prepare+0x5e/0x80
> ? notify_prepare+0x80/0x80
> cpuhp_invoke_callback+0x73/0x330
> ? __schedule+0x33e/0x8a0
> cpuhp_down_callbacks+0x51/0xc0
> cpuhp_thread_fun+0xc1/0xf0
> smpboot_thread_fn+0x159/0x2a0
> ? smpboot_create_threads+0x80/0x80
> kthread+0xef/0x110
> ? wait_for_completion+0xf0/0x120
> ? schedule_tail+0x35/0xf0
> ret_from_fork+0x22/0x50
> ? __init_kthread_worker+0x70/0x70
> ---[ end trace eb12ae47d2382d8f ]---
> notify_down_prepare: attempt to take down CPU 0 failed
>
> This bug can be reproduced by below config w/ nohz_full= all cpus:
>
> CONFIG_BOOTPARAM_HOTPLUG_CPU0=y
> CONFIG_DEBUG_HOTPLUG_CPU0=y
> CONFIG_NO_HZ_FULL=y
>
> The boot CPU handles housekeeping duty(unbound timers, workqueues,
> timekeeping, ...) on behalf of full dynticks CPUs. It must remain
> online when nohz full is enabled. There is a priority set to every
> notifier_blocks:
>
> workqueue_cpu_up > tick_nohz_cpu_down > workqueue_cpu_down
>
> So tick_nohz_cpu_down callback failed when down prepare cpu 0, and
> notifier_blocks behind tick_nohz_cpu_down will not be called any
> more, which leads to workers are actually not unbound. Then hotplug
> state machine will fallback to undo and online cpu 0 again. Workers
> will be rebound unconditionally even if they are not unbound and
> trigger the warning in this progress.
>
> This patch fix it by catching !DISASSOCIATED to avoid rebind bound
> workers.
>
> Cc: Tejun Heo <tj@xxxxxxxxxx>
> Cc: Lai Jiangshan <jiangshanlai@xxxxxxxxx>
> Suggested-by: Lai Jiangshan <jiangshanlai@xxxxxxxxx>
> Signed-off-by: Wanpeng Li <wanpeng.li@xxxxxxxxxxx>
> ---
> kernel/workqueue.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 2232ae3..cc18920 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -4525,6 +4525,12 @@ static void rebind_workers(struct worker_pool *pool)
> pool->attrs->cpumask) < 0);
>
> spin_lock_irq(&pool->lock);
> +
> + if (!(pool->flags & POOL_DISASSOCIATED)) {
> + spin_unlock_irq(&pool->lock);
> + return;
> + }
> +
> pool->flags &= ~POOL_DISASSOCIATED;
>
> for_each_pool_worker(worker, pool) {
> --
> 1.9.1
>