Re: [PATCH 1/2] cpuset: fix cpuset_cpus_allowed_fallback() don'tupdate tsk->rt.nr_cpus_allowed

From: Paul E. McKenney
Date: Sun May 15 2011 - 17:34:20 EST


On Fri, May 13, 2011 at 02:42:48PM +0800, Yong Zhang wrote:
> On Fri, May 13, 2011 at 1:48 PM, KOSAKI Motohiro
> <kosaki.motohiro@xxxxxxxxxxxxxx> wrote:
> > Hi
> >
> >> On Mon, 2011-05-02 at 19:55 +0900, KOSAKI Motohiro wrote:
> >>>
> >>> The rule is, we have to update tsk->rt.nr_cpus_allowed too if we change
> >>> tsk->cpus_allowed. Otherwise RT scheduler may confuse.
> >>>
> >>> This patch fixes it.
> >>>
> >>> btw, system_state checking is very important. current boot sequence is
> >>> (1) smp_init
> >>> (ie secondary cpus up and created cpu bound kthreads). (2)
> >>> sched_init_smp().
> >>> Then following bad scenario can be happen,
> >>>
> >>> (1) cpuup call notifier(CPU_UP_PREPARE)
> >>> (2) A cpu notifier consumer create FIFO kthread
> >>> (3) It call kthread_bind()
> >>>    ... but, now secondary cpu haven't ONLINE
> >>
> >> isn't
> >
> > thanks, correction.
> >
> >>
> >>> (3) schedule() makes fallback and cpuset_cpus_allowed_fallback
> >>>     change task->cpus_allowed
> >>
> >> I'm failing to see how this is happening, surely that kthread isn't
> >> actually running that early?
> >
> > If my understand correctly, current call graph is below.
> >
> > kernel_init()
> >        smp_init();
> >                cpu_up()
> >                        ... cpu hotplug notification
> >                                kthread_create()
> >        sched_init_smp();
> >
> >
> > So, cpu hotplug event is happen before sched_init_smp(). The old rule is,
> > all kthreads of using cpu-up notification have to use kthread_bind(). It
> > protected from sched load balance.
> >
> > but, now cpuset_cpus_allowed_fallback() forcedly change kthread's cpumask.
> > Why is this works? the point are two.
> >
> > - rcu_cpu_kthread_should_stop() call set_cpus_allowed_ptr() again
> > periodically.
> >  then, it can reset cpumask if cpuset_cpus_allowed_fallback() change it.
> >  my debug print obseve following cpumask change occur at boot time.
> >     1) kthread_bind: bind cpu1
> >     2) cpuset_cpus_allowed_fallback: bind possible cpu
> >     3) rcu_cpu_kthread_should_stop: rebind cpu1
> > - while tsk->rt.nr_cpus_allowed == 1, sched load balancer never be crash.
>
> Seems rcu_spawn_one_cpu_kthread() call wake_up_process() directly,
> which is under hotplug event CPU_UP_PREPARE. Maybe it should be
> under CPU_ONLINE.

Sorry, but this does not work. The kthread must be running by the time
the CPU appears, otherwise RCU grace periods in CPU_ONLINE notifiers
will never complete.

This did turn out to be a scheduler bug -- see Cheng Xu's recent patch.
(chengxu@xxxxxxxxxxxxxxxxxx)

Thanx, Paul

> Thanks,
> Yong
>
> >
> >
> >>
> >>> (4) find_lowest_rq() touch local_cpu_mask if task->rt.nr_cpus_allowed !=
> >>> 1,
> >>>     but it haven't been initialized.
> >>>
> >>> RCU folks plan to introduce such FIFO kthread and our testing hitted the
> >>> above issue. Then this patch also protect it.
> >>
> >> I'm fairly sure it doesn't, normal cpu-hotplug doesn't poke at
> >> system_state.
> >
> > If my understand correctly. it's pure scheduler issue. because
> >
> > - rcuc keep the old rule (ie an early spawned kthread have to call
> > kthread_bind)
> > - cpuset_cpus_allowed_fallback() is called from scheduler internal
> > - crash is happen in find_lowest_rq(). (following line)
> >
> >
> > static int find_lowest_rq(struct task_struct *task)
> > {
> >  (snip)
> >        if (cpumask_test_cpu(cpu, lowest_mask))   // HERE
> >
> >
> > IOW, I think cpuset_cpus_allowed_fallback() should NOT be changed
> > tsk->cpus_allowed until to finish sched_init_smp().
> >
> > Do you have an any alternative idea for this?
> >
> >
> >>> Signed-off-by: KOSAKI Motohiro<kosaki.motohiro@xxxxxxxxxxxxxx>
> >>> Cc: Oleg Nesterov<oleg@xxxxxxxxxx>
> >>> Cc: Peter Zijlstra<a.p.zijlstra@xxxxxxxxx>
> >>> Cc: Ingo Molnar<mingo@xxxxxxx>
> >>> ---
> >>>  include/linux/cpuset.h |    1 +
> >>>  kernel/cpuset.c        |    1 +
> >>>  kernel/sched.c         |    4 ++++
> >>>  3 files changed, 6 insertions(+), 0 deletions(-)
> >>>
> >>> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> >>> index f20eb8f..42dcbdc 100644
> >>> --- a/include/linux/cpuset.h
> >>> +++ b/include/linux/cpuset.h
> >>> @@ -147,6 +147,7 @@ static inline void cpuset_cpus_allowed(struct
> >>> task_struct *p,
> >>>  static inline int cpuset_cpus_allowed_fallback(struct task_struct *p)
> >>>  {
> >>>        cpumask_copy(&p->cpus_allowed, cpu_possible_mask);
> >>> +       p->rt.nr_cpus_allowed = cpumask_weight(&p->cpus_allowed);
> >>>        return cpumask_any(cpu_active_mask);
> >>>  }
> >>>
> >>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> >>> index 1ceeb04..6e5bbe8 100644
> >>> --- a/kernel/cpuset.c
> >>> +++ b/kernel/cpuset.c
> >>> @@ -2220,6 +2220,7 @@ int cpuset_cpus_allowed_fallback(struct task_struct
> >>> *tsk)
> >>>                cpumask_copy(&tsk->cpus_allowed, cpu_possible_mask);
> >>>                cpu = cpumask_any(cpu_active_mask);
> >>>        }
> >>> +       tsk->rt.nr_cpus_allowed = cpumask_weight(&tsk->cpus_allowed);
> >>>
> >>>        return cpu;
> >>>  }
> >>
> >> I don't really see the point of doing this separately from your second
> >> patch, please fold them.
> >
> > Ok. Will do.
> >
> >>
> >>> diff --git a/kernel/sched.c b/kernel/sched.c
> >>> index fd4625f..bfcd219 100644
> >>> --- a/kernel/sched.c
> >>> +++ b/kernel/sched.c
> >>> @@ -2352,6 +2352,10 @@ static int select_fallback_rq(int cpu, struct
> >>> task_struct *p)
> >>>        if (dest_cpu<  nr_cpu_ids)
> >>>                return dest_cpu;
> >>>
> >>> +       /* Don't worry. It's temporary mismatch. */
> >>> +       if (system_state<  SYSTEM_RUNNING)
> >>> +               return cpu;
> >>> +
> >>>        /* No more Mr. Nice Guy. */
> >>>        dest_cpu = cpuset_cpus_allowed_fallback(p);
> >>>        /*
> >>
> >> Like explained, I don't believe this actually fixes your problem (its
> >> also disgusting).
> >
> > If anybody have an alternative idea, I have no reason to refuse it.
> >
> > Thanks.
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> >
>
>
>
> --
> Only stand for myself
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/