RE: [PATCH] sched: fair: don't depend on wake_wide if waker and wakee are already in same LLC

From: Song Bao Hua (Barry Song)
Date: Wed May 26 2021 - 17:38:32 EST




> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@xxxxxxxxxxxxx]
> Sent: Thursday, May 27, 2021 12:16 AM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>
> Cc: vincent.guittot@xxxxxxxxxx; mingo@xxxxxxxxxx; dietmar.eggemann@xxxxxxx;
> rostedt@xxxxxxxxxxx; bsegall@xxxxxxxxxx; mgorman@xxxxxxx;
> valentin.schneider@xxxxxxx; juri.lelli@xxxxxxxxxx; bristot@xxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; guodong.xu@xxxxxxxxxx; yangyicong
> <yangyicong@xxxxxxxxxx>; tangchengchang <tangchengchang@xxxxxxxxxx>;
> Linuxarm <linuxarm@xxxxxxxxxx>
> Subject: Re: [PATCH] sched: fair: don't depend on wake_wide if waker and wakee
> are already in same LLC
>
>
> $subject is weird; sched/fair: is the right tag, and then start with a
> capital letter.
>
> On Wed, May 26, 2021 at 09:10:57PM +1200, Barry Song wrote:
> > when waker and wakee are already in the same LLC, it is pointless to worry
> > about the competition caused by pulling wakee to waker's LLC domain.
>
> But there's more than LLC.

I suppose other concerns might be about the "idle" and "load" of
waker's cpu and wakee's prev_cpu. Here even though we disable
wake_wide(), wake_affine() still has chance to select wakee's
prev_cpu rather than pulling to waker. So disabling wake_wide()
doesn't mean we will 100% pull.

static int wake_affine(struct sched_domain *sd, struct task_struct *p,
int this_cpu, int prev_cpu, int sync)
{
int target = nr_cpumask_bits;

if (sched_feat(WA_IDLE))
target = wake_affine_idle(this_cpu, prev_cpu, sync);

if (sched_feat(WA_WEIGHT) && target == nr_cpumask_bits)
target = wake_affine_weight(sd, p, this_cpu, prev_cpu, sync);

if (target == nr_cpumask_bits)
return prev_cpu;

..
return target;
}

Furthermore, select_idle_sibling() can also pick wakee's prev_cpu
if it is idle:

static int select_idle_sibling(struct task_struct *p, int prev, int target)
{
...

/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
if (prev != target && cpus_share_cache(prev, target) &&
(available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
asym_fits_capacity(task_util, prev))
return prev;
...
}

Except those, could you please give me some clue about what else
you have concerns on?

>
> > Signed-off-by: Barry Song <song.bao.hua@xxxxxxxxxxxxx>
> > ---
> > kernel/sched/fair.c | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3248e24a90b0..cfb1bd47acc3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6795,7 +6795,15 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu,
> int wake_flags)
> > new_cpu = prev_cpu;
> > }
> >
> > - want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, p->cpus_ptr);
> > + /*
> > + * we use wake_wide to make smarter pull and avoid cruel
> > + * competition because of jam-packed tasks in waker's LLC
> > + * domain. But if waker and wakee have been already in
> > + * same LLC domain, it seems it is pointless to depend
> > + * on wake_wide
> > + */
> > + want_affine = (cpus_share_cache(cpu, prev_cpu) || !wake_wide(p)) &&
> > + cpumask_test_cpu(cpu, p->cpus_ptr);
> > }
>
> And no supportive numbers...

Sorry for the confusion.

I actually put some supportive numbers at the below thread which
derived this patch:
https://lore.kernel.org/lkml/bbc339cef87e4009b6d56ee37e202daf@xxxxxxxxxxxxx/

when I tried to give Dietmar some pgbench data in that thread,
I found in kunpeng920, while software ran in one die/numa with
24cores sharing LLC, disabling wake_wide() brought the best
pgbench result.

llc_as_factor don't_use_wake_wide
Hmean 1 10869.27 ( 0.00%) 10723.08 * -1.34%*
Hmean 8 19580.59 ( 0.00%) 19469.34 * -0.57%*
Hmean 12 29643.56 ( 0.00%) 29520.16 * -0.42%*
Hmean 24 43194.47 ( 0.00%) 43774.78 * 1.34%*
Hmean 32 40163.23 ( 0.00%) 40742.93 * 1.44%*
Hmean 48 42249.29 ( 0.00%) 48329.00 * 14.39%*

The test was done by https://github.com/gormanm/mmtests
and
./run-mmtests.sh --config ./configs/config-db-pgbench-timed-ro-medium test_tag

Commit "sched: Implement smarter wake-affine logic"
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=62470419
says pgbench can improve by wake_wide(), but I've actually
seen the opposite result while waker and wakee are already
in one LLC.

Not quite sure if it is specific to kunpeng920, perhaps
I need to run the same test on some x86 machines.

Thanks
Barry