Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ

From: Joel Fernandes
Date: Fri Jan 22 2021 - 14:12:58 EST

Next message: Nick Desaulniers: "Re: [PATCH v4 8/8] mm: Mark anonymous struct field of 'struct vm_fault' as 'const'"
Previous message: Joe Perches: "Re: [PATCH] checkpatch: add warning for avoiding .L prefix symbols in assembly files"
In reply to: Vincent Guittot: "Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ"
Next in thread: Qais Yousef: "Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Vincent,

Thanks for reply. Please see the replies below:

On Fri, Jan 22, 2021 at 05:56:22PM +0100, Vincent Guittot wrote:
> On Fri, 22 Jan 2021 at 16:46, Joel Fernandes (Google)
> <joel@xxxxxxxxxxxxxxxxx> wrote:
> >
> > On an octacore ARM64 device running ChromeOS Linux kernel v5.4, I found
> > that there are a lot of calls to update_blocked_averages(). This causes
> > the schedule loop to slow down to taking upto 500 micro seconds at
> > times (due to newidle load balance). I have also seen this manifest in
> > the periodic balancer.
> >
> > Closer look shows that the problem is caused by the following
> > ingredients:
> > 1. If the system has a lot of inactive CGroups (thanks Dietmar for
> > suggesting to inspect /proc/sched_debug for this), this can make
> > __update_blocked_fair() take a long time.
>
> Inactive cgroups are removed from the list so they should not impact
> the duration

I meant blocked CGroups. According to this code, a cfs_rq can be partially
decayed and not have any tasks running on it but its load needs to be
decayed, correct? That's what I meant by 'inactive'. I can reword it to
'blocked'.

* There can be a lot of idle CPU cgroups. Don't let fully
* decayed cfs_rqs linger on the list.
*/
if (cfs_rq_is_decayed(cfs_rq))
list_del_leaf_cfs_rq(cfs_rq);

> > 2. The device has a lot of CPUs in a cluster which causes schedutil in a
> > shared frequency domain configuration to be slower than usual. (the load
>
> What do you mean exactly by it causes schedutil to be slower than usual ?

sugov_next_freq_shared() is order number of CPUs in the a cluster. This
system is a 6+2 system with 6 CPUs in a cluster. schedutil shared policy
frequency update needs to go through utilization of other CPUs in the
cluster. I believe this could be adding to the problem but is not really
needed to optimize if we can rate limit the calls to update_blocked_averages
to begin with.

> > average updates also try to update the frequency in schedutil).
> >
> > 3. The CPU is running at a low frequency causing the scheduler/schedutil
> > code paths to take longer than when running at a high CPU frequency.
>
> Low frequency usually means low utilization so it should happen that much.

It happens a lot as can be seen with schbench. It is super easy to reproduce.

schedule() can result in new idle balance with the CFS pick call happening
often. Here is a function graph trace. The tracer shows
update_blocked_averages taking a lot of time.

sugov:0-2454 [002] 2657.992570: funcgraph_entry: | load_balance() {
sugov:0-2454 [002] 2657.992577: funcgraph_entry: | update_group_capacity() {
sugov:0-2454 [002] 2657.992580: funcgraph_entry: 2.656 us | __msecs_to_jiffies();
sugov:0-2454 [002] 2657.992585: funcgraph_entry: 2.447 us | _raw_spin_lock_irqsave();
sugov:0-2454 [002] 2657.992591: funcgraph_entry: 2.552 us | _raw_spin_unlock_irqrestore();
sugov:0-2454 [002] 2657.992595: funcgraph_exit: + 17.448 us | }
sugov:0-2454 [002] 2657.992597: funcgraph_entry: 1.875 us | update_nohz_stats();
sugov:0-2454 [002] 2657.992601: funcgraph_entry: 1.667 us | idle_cpu();
sugov:0-2454 [002] 2657.992605: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992608: funcgraph_entry: + 33.333 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992643: funcgraph_exit: + 38.073 us | }
sugov:0-2454 [002] 2657.992645: funcgraph_entry: 1.770 us | idle_cpu();
sugov:0-2454 [002] 2657.992649: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992651: funcgraph_entry: + 41.823 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992694: funcgraph_exit: + 45.729 us | }
sugov:0-2454 [002] 2657.992696: funcgraph_entry: 1.823 us | idle_cpu();
sugov:0-2454 [002] 2657.992700: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992702: funcgraph_entry: + 35.312 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992740: funcgraph_exit: + 39.792 us | }
sugov:0-2454 [002] 2657.992742: funcgraph_entry: 1.771 us | idle_cpu();
sugov:0-2454 [002] 2657.992746: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992748: funcgraph_entry: + 33.438 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992783: funcgraph_exit: + 37.500 us | }
sugov:0-2454 [002] 2657.992785: funcgraph_entry: 1.771 us | idle_cpu();
sugov:0-2454 [002] 2657.992790: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992792: funcgraph_entry: + 45.521 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992839: funcgraph_exit: + 49.323 us | }
sugov:0-2454 [002] 2657.992842: funcgraph_entry: 1.823 us | idle_cpu();
sugov:0-2454 [002] 2657.992847: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992850: funcgraph_entry: + 67.187 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992919: funcgraph_exit: + 72.031 us | }
sugov:0-2454 [002] 2657.992921: funcgraph_entry: 2.760 us | idle_cpu();
sugov:0-2454 [002] 2657.992926: funcgraph_entry: | update_nohz_stats() {
sugov:0-2454 [002] 2657.992928: funcgraph_entry: + 61.146 us | update_blocked_averages();
sugov:0-2454 [002] 2657.992992: funcgraph_exit: + 65.886 us | }
sugov:0-2454 [002] 2657.992994: funcgraph_entry: 1.771 us | idle_cpu();
sugov:0-2454 [002] 2657.992998: funcgraph_exit: ! 430.209 us | }
sugov:0-2454 [002] 2657.993006: bprint: trace_long: wtf: lb: 432916
sugov:0-2454 [002] 2657.993017: bprint: trace_long: wtf: newidle_balance: 501458

> > The fix is simply rate limit the calls to update_blocked_averages to 20
> > times per second. It appears that updating the blocked average less
> > often is sufficient. Currently I see about 200 calls per second
>
> Would be good to explain why updating less often is sufficient ?

I don't know this code that well, intuitively it seems to me updating blocked
averages at such a high rate seems pointless. But I defer to your expertise
on that. Why do you feel an update is needed at least HZ times per second?
What about system with HZ=1000 or 300, that seems to be an insane rate of
updating (not to mention all the complexity of going through the leaf cgroup
list and doing the frequency updates).

> > sometimes, which seems overkill.
> >
> > schbench shows a clear improvement with the change:
>
> Have you got more details about your test setup ?
> which platform ?
> which kernel ?

I mentioned in the commit message it is a v5.4 kernel.

The platform is Snapdragon 7c. The platform is:
https://www.qualcomm.com/products/snapdragon-7c-compute-platform

>
> >
> > Without patch:
> > ~/schbench -t 2 -m 2 -r 5
> > Latency percentiles (usec) runtime 5 (s) (212 total samples)
> > 50.0th: 210 (106 samples)
> > 75.0th: 619 (53 samples)
> > 90.0th: 665 (32 samples)
> > 95.0th: 703 (11 samples)
> > *99.0th: 12656 (8 samples)
> > 99.5th: 12784 (1 samples)
> > 99.9th: 13424 (1 samples)
> > min=15, max=13424
> >
> > With patch:
> > ~/schbench -t 2 -m 2 -r 5
> > Latency percentiles (usec) runtime 5 (s) (214 total samples)
> > 50.0th: 188 (108 samples)
> > 75.0th: 238 (53 samples)
> > 90.0th: 623 (32 samples)
> > 95.0th: 657 (12 samples)
> > *99.0th: 717 (7 samples)
> > 99.5th: 725 (2 samples)
> > 99.9th: 725 (0 samples)
> >
> > Cc: Paul McKenney <paulmck@xxxxxxxxxx>
> > Cc: Frederic Weisbecker <fweisbec@xxxxxxxxx>
> > Suggested-by: Dietmar Eggeman <dietmar.eggemann@xxxxxxx>
> > Co-developed-by: Qais Yousef <qais.yousef@xxxxxxx>
> > Signed-off-by: Qais Yousef <qais.yousef@xxxxxxx>
> > Signed-off-by: Joel Fernandes (Google) <joel@xxxxxxxxxxxxxxxxx>
> >
> > ---
> > kernel/sched/fair.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 04a3ce20da67..fe2dc0024db5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8381,7 +8381,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
> > if (!cpumask_test_cpu(cpu, nohz.idle_cpus_mask))
> > return false;
> >
> > - if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick))
> > + if (!force && !time_after(jiffies, rq->last_blocked_load_update_tick + (HZ/20)))
>
> This condition is there to make sure to update blocked load at most
> once a tick in order to filter newly idle case otherwise the rate
> limit is already done by load balance interval

What prevents newidle_balance from hitting this code? Are you suggesting rate
limit that as well?

> This hard coded (HZ/20) looks really like an ugly hack

Yes, it was not a well researched number. If you have a suggestion for better
fix, let me know.

thanks,

- Joel

> > return true;
> >
> > update_blocked_averages(cpu);
> > --
> > 2.30.0.280.ga3ce27912f-goog
> >

Next message: Nick Desaulniers: "Re: [PATCH v4 8/8] mm: Mark anonymous struct field of 'struct vm_fault' as 'const'"
Previous message: Joe Perches: "Re: [PATCH] checkpatch: add warning for avoiding .L prefix symbols in assembly files"
In reply to: Vincent Guittot: "Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ"
Next in thread: Qais Yousef: "Re: [PATCH] sched/fair: Rate limit calls to update_blocked_averages() for NOHZ"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]