Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()

From: Peter Zijlstra
Date: Thu Jun 01 2023 - 07:13:42 EST

Next message: Christoph Hellwig: "Re: [PATCH v5 RESEND 07/17] arc: mm: Convert to GENERIC_IOREMAP"
Previous message: Christoph Hellwig: "Re: [PATCH v5 RESEND 14/17] mm/ioremap: Consider IOREMAP space in generic ioremap"
In reply to: K Prateek Nayak: "Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()"
Next in thread: Peter Zijlstra: "Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
> Hello Peter,
>
> Sharing some initial benchmark results with the patch below.
>
> tl;dr
>
> - Hackbench starts off well but performance drops as the number of groups
> increases.
>
> - schbench (old), tbench, netperf see improvement but there is a band of
> outlier results when system is fully loaded or slightly overloaded.
>
> - Stream and ycsb-mongodb are don't mind the extra search.
>
> - SPECjbb (with default scheduler tunables) and DeathStarBench are not
> very happy.

Figures :/ Every time something like this is changed someone gets to be
sad..

> Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
> running in NPS1 mode. Following it the simplified machine topology:

Right, Zen3 8 cores / LLC, 64 cores total give 8 LLC per node.

> ~~~~~~~~~~~~~~~~~~~~~~~
> ~ SPECjbb - Multi-JVM ~
> ~~~~~~~~~~~~~~~~~~~~~~~
>
> o NPS1
>
> - Default Scheduler Tunables
>
> kernel max-jOPS critical-jOPS
> tip 100.00% 100.00%
> peter-next-level 94.45% (-5.55%) 98.25% (-1.75%)
>
> - Modified Scheduler Tunables
>
> kernel max-jOPS critical-jOPS
> tip 100.00% 100.00%
> peter-next-level 100.00% (0.00%) 102.41% (2.41%)

I'm slightly confused, either the default or the tuned is better. Given
it's counting ops, I'm thinking higher is more better, so isn't this an
improvement in the tuned case?

> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
>
> Pinning Scaling tip peter-next-level
> 1 CCD 1 100.00% 100.30% (%diff: 0.30%)
> 2 CCD 2 100.00% 100.17% (%diff: 0.17%)
> 4 CCD 4 100.00% 99.60% (%diff: -0.40%)
> 8 CCD 8 100.00% 92.05% (%diff: -7.95%) *

Right, so that's a definite loss.

> I wonder if extending SIS_UTIL for SIS_NODE would help some of these
> cases but I've not tried tinkering with it yet. I'll continue
> testing on other NPS modes which would decrease the search scope.
> I'll also try running the same bunch of workloads on an even larger
> 4th Generation EPYC server to see if the behavior there is similar.

> > /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + struct sched_domain *parent = sd->parent;
> > + struct sched_group *sg;
> > +
> > + /* Make sure to not cross nodes. */
> > + if (!parent || parent->flags & SD_NUMA)
> > + return -1;
> > +
> > + sg = parent->groups;
> > + do {
> > + int cpu = cpumask_first(sched_group_span(sg));
> > + struct sched_domain *sd_child;
> > +
> > + sd_child = per_cpu(sd_llc, cpu);
> > + if (sd_child != sd) {
> > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);

Given how SIS_UTIL is inside select_idle_cpu() it should already be
effective here, no?

> > + if ((unsigned)i < nr_cpumask_bits)
> > + return i;
> > + }
> > +
> > + sg = sg->next;
> > + } while (sg != parent->groups);
> > +
> > + return -1;
> > +}

This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
isn't too much of a bother; so perhaps something like so?

(on top of tip/sched/core from just a few hours ago, as I had to 'fix'
this patch and force pushed the thing)

And yeah, random hacks and heuristics here :/ Does there happen to be
additional topology that could aid us here? Does the CCD fabric itself
have a distance metric we can use?

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22e0a249e0a8..f1d6ed973410 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7036,6 +7036,7 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
{
struct sched_domain *parent = sd->parent;
struct sched_group *sg;
+ int nr = 4;

/* Make sure to not cross nodes. */
if (!parent || parent->flags & SD_NUMA)
@@ -7050,6 +7051,9 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
test_idle_cores(cpu), cpu);
if ((unsigned)i < nr_cpumask_bits)
return i;
+
+ if (!--nr)
+ return -1;
}

sg = sg->next;

Next message: Christoph Hellwig: "Re: [PATCH v5 RESEND 07/17] arc: mm: Convert to GENERIC_IOREMAP"
Previous message: Christoph Hellwig: "Re: [PATCH v5 RESEND 14/17] mm/ioremap: Consider IOREMAP space in generic ioremap"
In reply to: K Prateek Nayak: "Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()"
Next in thread: Peter Zijlstra: "Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]