Re: [PATCH] sched/core: Fix kick offline cpu to do nohz idle load balance

From: Peter Zijlstra
Date: Mon Oct 10 2016 - 08:20:26 EST


On Mon, Oct 10, 2016 at 04:34:48PM +0800, Wanpeng Li wrote:
> > If there is a need to kick the idle load balancer, an ILB will be selected
> > to perform nohz idle load balance, however, if the selected ILB is in the
> > process of offline, smp_sched_reschedule() which generates a sched IPI will
> > splat as above.
> >
> > CPU0 CPU1
> >
> > find_new_ilb()
> > set_rq_offline()
> > smp_sched_reschedule() Oops
> > nohz_balance_exit_idle()
> >
> > This patch fix it by exiting nohz idle balance before set cpu offline.
>
> CPU 0 CPU1
>
> find_new_ilb()
> nohz_balance_exit_idle()
> set_rq_offline()
> smp_sched_reschedule()
>
> It seems that the patch still can't avoid this race, so any proposal
> is a great appreciated. :)


Not sure how this can happen, scheduler_tick() -> trigger_load_balance()
-> nohz_balancer_kick() is called with IRQs disabled, this too implies a
RCU-sched read side section.

And hotplug explicitly includes a rcu_sync_sched().

It would be find_new_ilb() is 'broken' in that it considers !active
CPUs. That's not immediately obvious.