Re: [tip: sched/core] sched/fair: Multi-LLC select_idle_sibling()
From: Peter Zijlstra
Date: Thu Jun 01 2023 - 07:13:42 EST
On Thu, Jun 01, 2023 at 03:03:39PM +0530, K Prateek Nayak wrote:
> Hello Peter,
>
> Sharing some initial benchmark results with the patch below.
>
> tl;dr
>
> - Hackbench starts off well but performance drops as the number of groups
> increases.
>
> - schbench (old), tbench, netperf see improvement but there is a band of
> outlier results when system is fully loaded or slightly overloaded.
>
> - Stream and ycsb-mongodb are don't mind the extra search.
>
> - SPECjbb (with default scheduler tunables) and DeathStarBench are not
> very happy.
Figures :/ Every time something like this is changed someone gets to be
sad..
> Tests were run on a dual socket 3rd Generation EPYC server(2 x64C/128T)
> running in NPS1 mode. Following it the simplified machine topology:
Right, Zen3 8 cores / LLC, 64 cores total give 8 LLC per node.
> ~~~~~~~~~~~~~~~~~~~~~~~
> ~ SPECjbb - Multi-JVM ~
> ~~~~~~~~~~~~~~~~~~~~~~~
>
> o NPS1
>
> - Default Scheduler Tunables
>
> kernel max-jOPS critical-jOPS
> tip 100.00% 100.00%
> peter-next-level 94.45% (-5.55%) 98.25% (-1.75%)
>
> - Modified Scheduler Tunables
>
> kernel max-jOPS critical-jOPS
> tip 100.00% 100.00%
> peter-next-level 100.00% (0.00%) 102.41% (2.41%)
I'm slightly confused, either the default or the tuned is better. Given
it's counting ops, I'm thinking higher is more better, so isn't this an
improvement in the tuned case?
> ~~~~~~~~~~~~~~~~~~
> ~ DeathStarBench ~
> ~~~~~~~~~~~~~~~~~~
>
> Pinning Scaling tip peter-next-level
> 1 CCD 1 100.00% 100.30% (%diff: 0.30%)
> 2 CCD 2 100.00% 100.17% (%diff: 0.17%)
> 4 CCD 4 100.00% 99.60% (%diff: -0.40%)
> 8 CCD 8 100.00% 92.05% (%diff: -7.95%) *
Right, so that's a definite loss.
> I wonder if extending SIS_UTIL for SIS_NODE would help some of these
> cases but I've not tried tinkering with it yet. I'll continue
> testing on other NPS modes which would decrease the search scope.
> I'll also try running the same bunch of workloads on an even larger
> 4th Generation EPYC server to see if the behavior there is similar.
> > /*
> > + * For the multiple-LLC per node case, make sure to try the other LLC's if the
> > + * local LLC comes up empty.
> > + */
> > +static int
> > +select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
> > +{
> > + struct sched_domain *parent = sd->parent;
> > + struct sched_group *sg;
> > +
> > + /* Make sure to not cross nodes. */
> > + if (!parent || parent->flags & SD_NUMA)
> > + return -1;
> > +
> > + sg = parent->groups;
> > + do {
> > + int cpu = cpumask_first(sched_group_span(sg));
> > + struct sched_domain *sd_child;
> > +
> > + sd_child = per_cpu(sd_llc, cpu);
> > + if (sd_child != sd) {
> > + int i = select_idle_cpu(p, sd_child, test_idle_cores(cpu), cpu);
Given how SIS_UTIL is inside select_idle_cpu() it should already be
effective here, no?
> > + if ((unsigned)i < nr_cpumask_bits)
> > + return i;
> > + }
> > +
> > + sg = sg->next;
> > + } while (sg != parent->groups);
> > +
> > + return -1;
> > +}
This DeathStarBench thing seems to suggest that scanning up to 4 CCDs
isn't too much of a bother; so perhaps something like so?
(on top of tip/sched/core from just a few hours ago, as I had to 'fix'
this patch and force pushed the thing)
And yeah, random hacks and heuristics here :/ Does there happen to be
additional topology that could aid us here? Does the CCD fabric itself
have a distance metric we can use?
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22e0a249e0a8..f1d6ed973410 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7036,6 +7036,7 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
{
struct sched_domain *parent = sd->parent;
struct sched_group *sg;
+ int nr = 4;
/* Make sure to not cross nodes. */
if (!parent || parent->flags & SD_NUMA)
@@ -7050,6 +7051,9 @@ select_idle_node(struct task_struct *p, struct sched_domain *sd, int target)
test_idle_cores(cpu), cpu);
if ((unsigned)i < nr_cpumask_bits)
return i;
+
+ if (!--nr)
+ return -1;
}
sg = sg->next;