Re: [PATCH v3] sched/fair: filter out overloaded cpus in SIS

From: Chen Yu
Date: Mon May 09 2022 - 11:21:25 EST


On Sun, May 8, 2022 at 1:50 AM Abel Wu <wuyun.abel@xxxxxxxxxxxxx> wrote:
>
> Hi Chen,
>
> On 5/8/22 12:09 AM, Chen Yu Wrote:
[cut]
> >> @@ -81,8 +81,20 @@ struct sched_domain_shared {
> >> atomic_t ref;
> >> atomic_t nr_busy_cpus;
> >> int has_idle_cores;
> >> +
> >> + /*
> >> + * Tracking of the overloaded cpus can be heavy, so start
> >> + * a new cacheline to avoid false sharing.
> >> + */
> > Although we put the following items into different cache line compared to
> > above ones, is it possible that there is still cache false sharing if
> > CPU1 is reading nr_overloaded_cpus while
> > CPU2 is updating overloaded_cpus?
>
> I think it's not false sharing, it's just cache contention. But yes,
> it is still possible if the two items mixed with others (by compiler)
> in one cacheline, which seems out of our control..
>
My understanding is that, since nr_overloaded_cpus starts with a new
cache line, overloaded_cpus is very likely to be in the same cache line.
Only If the write to nr_overloaded_cpus mask is not frequent(maybe tick based
update is not frequent), the read of nr_overloaded_cpus can survive from cache
false sharing, which is mainly read by SIS. I have a stupid thought
that if nr_overloaded_cpus
mask and nr_overloaded_cpus could be put to 2 cache lines.
> >> + atomic_t nr_overloaded_cpus ____cacheline_aligned;
> > ____cacheline_aligned seems to put nr_overloaded_cpus into data section, which
> > seems to be unnecessary. Would ____cacheline_internodealigned_in_smp
> > be more lightweight?
>
> I didn't see the difference of the two macros, it would be appreciate
> if you can shed some light.
>
Sorry I mistook ____cacheline_aligned for __cacheline_aligned which is
put into a data section. Please ignore my previous comment.
> >> + unsigned long overloaded_cpus[]; /* Must be last */
> >> };
> >>
[cut]
> >> + /*
> >> + * It's unlikely to find an idle cpu if the system is under
> >> + * heavy pressure, so skip searching to save a few cycles
> >> + * and relieve cache traffic.
> >> + */
> >> + if (weight - nro < (nr >> 4) && !has_idle_core)
> >> + return -1;
> > In [1] we used util_avg to check if the domain is overloaded and quit
> > earlier, since util_avg would be
> > more stable and contains historic data. But I think nr_running in your
> > patch could be used as
> > complementary metric and added to update_idle_cpu_scan() in [1] IMO.
> >> +
> >> cpumask_and(cpus, sched_domain_span(sd), p->cpus_ptr);
> >> + if (nro > 1)
> >> + cpumask_andnot(cpus, cpus, sdo_mask(sds));
> > If I understand correctly, this is the core of the optimization: SIS
> > filters out the busy cores. I wonder if it
> > is possible to save historic h_nr_running/idle_h_nr_running and use
> > the average value? (like the calculation
> > of avg_scan_cost).
>
> Yes, I have been already working on that for several days, and
> along with some improvement on load balance (group_has_spare).
> Ideally we can finally get rid out of the cache issues.
>
Ok, could you please also Cc me in the next version? I'd like to have
a try.

--
Thanks,
Chenyu