Re: [PATCH 1/1] sched: cfs_rq h_load might not update due to irq disable

From: Peter Zijlstra
Date: Thu Nov 21 2019 - 07:38:09 EST


On Thu, Nov 21, 2019 at 04:30:09PM +0800, YT Chang wrote:
> Syndrome:
>
> Two CPUs might do idle balance in the same time.
> One CPU does idle balance and pulls some tasks.
> However before pick next task, ALL task are pulled back to other CPU.
> That results in infinite loop in both CPUs.

Can you easily reproduce this?

> =========================================
> code flow:
>
> in pick_next_task_fair()
>
> again:
>
> if nr_running == 0
> goto idle
> pick next task
> return
>
> idle:
> idle_balance
> /* pull some tasks from other CPU,
> * However other CPU are also do idle balance,
> * and pull back these task */
>
> go to again
>
> =========================================
> The result to pull ALL tasks back when the task_h_load
> is incorrect and too low.

Clearly you're not running a PREEMPT kernel, otherwise the break in
detach_tasks() would've saved you, right?

> static unsigned long task_h_load(struct task_struct *p)
> {
> struct cfs_rq *cfs_rq = task_cfs_rq(p);
>
> update_cfs_rq_h_load(cfs_rq);
> return div64_ul(p->se.avg.load_avg_contrib * cfs_rq->h_load,
> cfs_rq->runnable_load_avg + 1);
> }
>
> The cfs_rq->h_load is incorrect and might too small.
> The original idea of cfs_rq::last_h_load_update will not
> update cfs_rq::h_load more than once a jiffies.
> When the Two CPUs pull each other in the pick_next_task_fair,
> the irq disabled and result in jiffie not update.
> (Other CPUs wait for runqueue lock locked by the two CPUs.
> So, ALL CPUs are irq disabled.)

This cannot be true; because the loop drops rq->lock, so other CPUs
should have an opportunity to acquire the lock and make progress.

> Solution:
> cfs_rq h_load might not update due to irq disable
> use sched_clock instead jiffies
>
> Signed-off-by: YT Chang <yt.chang@xxxxxxxxxxxx>
> ---
> kernel/sched/fair.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 83ab35e..231c53f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7578,9 +7578,11 @@ static void update_cfs_rq_h_load(struct cfs_rq *cfs_rq)
> {
> struct rq *rq = rq_of(cfs_rq);
> struct sched_entity *se = cfs_rq->tg->se[cpu_of(rq)];
> - unsigned long now = jiffies;
> + u64 now = sched_clock_cpu(cpu_of(rq));
> unsigned long load;
>
> + now = now * HZ >> 30;
> +
> if (cfs_rq->last_h_load_update == now)
> return;
>

This is disguisting and wrong. That is not the correct relation between
sched_clock() and jiffies.