Re: [PATCH] sched/fair: Track target domain's avg_scan_cost in select_idle_cpu

From: Vincent Guittot
Date: Tue Jan 11 2022 - 11:30:34 EST


On Tue, 11 Jan 2022 at 09:32, Barry Song <21cnbao@xxxxxxxxx> wrote:
>
> On Tue, Jan 11, 2022 at 8:59 PM Vincent Guittot
> <vincent.guittot@xxxxxxxxxx> wrote:
> >
> > On Thu, 6 Jan 2022 at 10:10, Yicong Yang <yangyicong@xxxxxxxxxx> wrote:
> > >
> > > On 2022/1/4 22:18, Vincent Guittot wrote:
> > > > On Thu, 23 Dec 2021 at 09:23, Yicong Yang <yangyicong@xxxxxxxxxx> wrote:
> > > >>
> > > >> On 2021/12/22 18:47, Vincent Guittot wrote:
> > > >>> On Sat, 11 Dec 2021 at 11:43, Yicong Yang <yangyicong@xxxxxxxxxxxxx> wrote:
> > > >>>>
> > > >>>> We regulate the LLC domain scan in select_idle_cpu() by comparing
> > > >>>> the average scan cost of this_sd against the average idle time of
> > > >>>> this_rq. This is correct when the domain to scan is the LLC domain
> > > >>>> of this cpu. But when the domain to scan is different from this
> > > >>>> LLC domain, we'll have an inaccurate estimation of the scan cost
> > > >>>> on the target domain as this_sd->avg_scan_cost contains contributions
> > > >>>> of scanning other domains besides the target domain.
> > > >>>>
> > > >>>> Track the avg_scan_cost of the target domain to make the estimation
> > > >>>> more accurate.
> > > >>>>
> > > >>>> Signed-off-by: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
> > > >>>> ---
> > > >>>> kernel/sched/fair.c | 4 ++--
> > > >>>> 1 file changed, 2 insertions(+), 2 deletions(-)
> > > >>>>
> > > >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > >>>> index 6e476f6d9435..6301740d98cb 100644
> > > >>>> --- a/kernel/sched/fair.c
> > > >>>> +++ b/kernel/sched/fair.c
> > > >>>> @@ -6267,7 +6267,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > > >>>> }
> > > >>>>
> > > >>>> avg_idle = this_rq->wake_avg_idle;
> > > >>>> - avg_cost = this_sd->avg_scan_cost + 1;
> > > >>>> + avg_cost = sd->avg_scan_cost + 1;
> > > >>>>
> > > >>>> span_avg = sd->span_weight * avg_idle;
> > > >>>> if (span_avg > 4*avg_cost)
> > > >>>> @@ -6305,7 +6305,7 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > > >>>> */
> > > >>>> this_rq->wake_avg_idle -= min(this_rq->wake_avg_idle, time);
> > > >>>>
> > > >>>> - update_avg(&this_sd->avg_scan_cost, time);
> > > >>>> + update_avg(&sd->avg_scan_cost, time);
> > > >>>
> > > >>> But then you can have several cpus updating the same value simultaneously
> > > >>>
> > > >>
> > > >> yes. sd->avg_scan_cost should includes the contributions of all the cpus scanned the sd.
> > > >>
> > > >> We regulated the scanning nr based on two things:
> > > >> - avg_idle: to indicate how much time we can have for this time scanning
> > > >> - avg_cost: to indicate how much time we'll spend for scanning the target domain based
> > > >> on the history cost
> > > >>
> > > >> Previously sd->avg_scan_cost may not reflect the cost as it count the scanning cost
> > > >> on the domain of the scanner cpu, which may not be the domain the cpu scanned.
> > > >> For example, cpu 0 on llc A scanned llc B and llc C, we'll count the cost of scanning B
> > > >> and C on llc A's avg_scan_cost and we'll use this to estimate the cost for scanning
> > > >> llc A, which is not accurate.
> > > >
> > > > I mean that you can now have several CPUs that will read, modify,
> > > > write sd->avg_scan_cost simultaneously without any protection
> > > >
> > >
> > > uh I misunderstood. not sure I've missed something, but looks like we also have this problem
> > > when updating &this_sd->avg_scan_cost?
> >
> > No because this_sd->avg_scan_cost is only used by the local cpu so you
> > don't have several cpus trying to update it simultaneously.
>
> Though this is a problem, does the original idea in this patch make some sense?
> CPUx might scan its own LLC and it might scan other LLCs. the scan cost should
> be depending on how busy the target LLC is?
>
> it seems improper to have one single scan cost for all targets/LLCs?

I'm not sure that this would make any real difference. We use local
cpu to prevent costly protection and cache line bouncing between cpus.
There are several shortfalls in the current way to estimate how many
cores we should scan.

>
> >
> > >
> > > >>
> > > >>>> }
> > > >>>>
> > > >>>> return idle_cpu;
> > > >>>> --
> > > >>>> 2.33.0
> > > >>>>
> > > >>> .
> > > >>>
> > > > .
> > > >
>
> Thanks
> Barry