Re: [RFC PATCH 3/4] sched/fair: Calculate the scan depth for idle balance based on system utilization
From: Chen Yu
Date: Fri Jun 23 2023 - 10:33:47 EST
Hi Peter,
On 2023-06-21 at 13:17:21 +0200, Peter Zijlstra wrote:
> On Tue, Jun 13, 2023 at 12:18:57AM +0800, Chen Yu wrote:
> > When CPU is about to enter idle, it invokes newidle_balance() to pull
> > some tasks from other runqueues. Although there is per domain
> > max_newidle_lb_cost to throttle the newidle_balance(), it would be
> > good to further limit the scan based on overall system utilization.
> > The reason is that there is no limitation for newidle_balance() to
> > launch this balance simultaneously on multiple CPUs. Since each
> > newidle_balance() has to traverse all the CPUs to calculate the
> > statistics one by one, this total time cost on newidle_balance()
> > could be O(n^2). This is not good for performance or power saving.
>
> Another possible solution is to keep struct sg_lb_stats in
> sd->child->shared (below the NUMA domains) and put a lock around it.
>
> Then have update_sd_lb_stats() do something like:
>
> struct sg_lb_stats *sgs = &sds->sgs;
>
> if (raw_spin_trylock(&sds->sg_lock)) {
> struct sg_lb_stats tmp;
>
> ... collect tmp
>
> sds->sgs = tmp;
> raw_spin_unlock(&sds->sg_lock);
> }
>
> ... use sgs
>
> Then you know you've always got a 'recent' copy but avoid the concurrent
> updates.
Thanks for taking a look and gave the suggestions! Yes, this is a good idea, by
doing this we can further limit the number of CPU to do update in parallel, and
allow the newidle CPU to reuse the data for idle load balance from others.
This lock only allow 1 CPU in that domain to iterate the whole group, and the
bottleneck might reply on how fast the CPU who grabs the lock can finish
collecting the tmp sgs data. For MC domain, it would not take too much time, and for
higher domains between MC and NUMA domain, it depends on how many CPUs there are in that
domain. I'll create one prototype based on your suggestion and measure the test data.
thanks,
Chenyu