Re: Endless soft-lockups for compiling workload since next-20200519

From: Frederic Weisbecker
Date: Thu May 21 2020 - 08:41:19 EST


On Thu, May 21, 2020 at 01:00:27PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 12:49:37PM +0200, Peter Zijlstra wrote:
> > On Thu, May 21, 2020 at 11:39:39AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> >
> > > This:
> > >
> > > > smp_call_function_single_async() { smp_call_function_single_async() {
> > > > // verified csd->flags != CSD_LOCK // verified csd->flags != CSD_LOCK
> > > > csd->flags = CSD_LOCK csd->flags = CSD_LOCK
> > >
> > > concurrent smp_call_function_single_async() using the same csd is what
> > > I'm looking at as well.
> >
> > So something like this ought to cure the fundamental problem and make
> > smp_call_function_single_async() more user friendly, but also more
> > expensive.
> >
> > The problem is that while the ILB case is easy to fix, I can't seem to
> > find an equally nice solution for the ttwu_remote_queue() case; that
> > would basically require sticking the wake_csd in task_struct, I'll also
> > post that.
> >
> > So it's either this:
>
> Or this:
>
> ---
> include/linux/sched.h | 4 ++++
> kernel/sched/core.c | 7 ++++---
> kernel/sched/fair.c | 2 +-
> kernel/sched/sched.h | 1 -
> 4 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f38d62c4632c..136ee400b568 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
> struct uclamp_se uclamp[UCLAMP_CNT];
> #endif
>
> +#ifdef CONFIG_SMP
> + call_single_data_t wake_csd;
> +#endif
> +
> #ifdef CONFIG_PREEMPT_NOTIFIERS
> /* List of struct preempt_notifier: */
> struct hlist_head preempt_notifiers;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5b286469e26e..a7129652e89b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2320,7 +2320,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
>
> if (llist_add(&p->wake_entry, &rq->wake_list)) {
> if (!set_nr_if_polling(rq->idle))
> - smp_call_function_single_async(cpu, &rq->wake_csd);
> + smp_call_function_single_async(cpu, &p->wake_csd);
> else
> trace_sched_wake_idle_without_ipi(cpu);
> }
> @@ -2921,6 +2921,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
> #endif
> #if defined(CONFIG_SMP)
> p->on_cpu = 0;
> + p->wake_csd = (struct __call_single_data) {
> + .func = wake_csd_func,
> + };
> #endif
> init_task_preempt_count(p);
> #ifdef CONFIG_SMP
> @@ -6723,8 +6726,6 @@ void __init sched_init(void)
> rq->avg_idle = 2*sysctl_sched_migration_cost;
> rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>
> - rq_csd_init(rq, &rq->wake_csd, wake_csd_func);
> -
> INIT_LIST_HEAD(&rq->cfs_tasks);
>
> rq_attach_root(rq, &def_root_domain);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 01f94cf52783..b6d8a7b991f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
> * is idle. And the softirq performing nohz idle load balance
> * will be run before returning from the IPI.
> */
> - smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
> + smp_call_function_single_async(ilb_cpu, &this_rq()->nohz_csd);

My fear here is that if a previous call from the the same CPU but to another
target is still pending, the new one will be spuriously ignored.

Namely this could happen:

CPU 0 CPU 1
----- -----
local_irq_disable() or VMEXIT
kick_ilb() {
smp_call_function_single_async(CPU 1,
&this_rq()->nohz_csd);
}

kick_ilb() {
smp_call_function_single_async(CPU 2,
&this_rq()->nohz_csd) {
// IPI to CPU 2 ignored
if (csd->flags == CSD_LOCK)
return -EBUSY;
}
}

local_irq_enable();




But I believe we can still keep the remote csd if nohz_flags() are
strictly only set before the IPI and strictly only cleared from it.

And I still don't understand why trigger_load_balance() raise the
softirq without setting the current CPU as ilb. run_rebalance_domains()
thus ignores it most of the time in the end or it spuriously clear the
nohz_flags set by an IPI sender. Or there is something I misunderstood
there.

(Haven't checked the wake up case yet).