Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

From: Vincent Guittot
Date: Fri Jun 25 2021 - 04:50:32 EST


On Fri, 18 Jun 2021 at 18:14, Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx> wrote:
>
>
>
> On 6/18/21 3:28 AM, Vincent Guittot wrote:
>
> >>
> >> The current logic is when a CPU becomes idle, next_balance occur very
> >> shortly (usually in the next jiffie) as get_sd_balance_interval returns
> >> the next_balance in the next jiffie if the CPU is idle. However, in
> >> reality, I saw most CPUs are 95% busy on average for my workload and
> >> a task will wake up on an idle CPU shortly. So having frequent idle
> >> balancing towards shortly idle CPUs is counter productive and simply
> >> increase overhead and does not improve performance.
> >
> > Just to make sure that I understand your problem correctly: Your problem is:
> > - that we have an ilb happening on the idle CPU and consume cycle
>
> That's right. The cycles are consumed heavily in update_blocked_averages()
> when cgroup is enabled.

But they are normally consumed on an idle CPU and the ILB checks
need_resched() before running load balance for the next idle CPU.

Does it mean that your problem is coming from update_blocked_average()
spending a long time with rq_lock_irqsave and increasing the wakeup
latency of your short running task ?

>
> > - or that the ilb will pull a task on an idle CPU on which a task will
> > shortly wakeup which ends to 2 tasks competing for the same CPU.
> >
>
> Because for the OLTP workload I'm looking at, we have tasks that sleep
> for a short while and wake again very shortly (i.e. the CPU actually
> is ~95% busy on average), pulling tasks to such a CPU is really not
> helpful to improve overall CPU utilization in the system. So my
> intuition is for such almost fully busy CPU, we should defer load
> balancing to it (see prototype patch 3).

Note that this is at the opposite of what you said earlier:
"
Though in our test environment, sysctl_sched_migration_cost was kept
much lower (25000) compared to the default (500000), to encourage
migrations to idle cpu
and reduce latency.
"

But, it will be quite hard to find a value that fits to requirements
for everybody and some will have UCs for which they want to pull tasks
even if the CPU is 95% busy; You can have 2ms of idle time but having
a utilization above 95% and an ILB inside a Core or at LLC is somewhat
cheap and would take advantage of those 2ms

>
> Tim
>
>
>
>