Re: [PATCH v5 6/6] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters

From: Ricardo Neri

Date: Thu Jun 25 2026 - 20:11:10 EST

On Tue, Jun 23, 2026 at 10:14:57PM -0700, Ricardo Neri wrote:
> On Tue, Jun 23, 2026 at 09:26:57AM +0200, Vincent Guittot wrote:
> > On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> > <ricardo.neri-calderon@xxxxxxxxxxxxxxx> wrote:
> > >
> > > Some topologies have scheduling domains that contain CPUs of asymmetric
> > > capacity, grouped into two or more clusters of equal-capacity CPUs
> > > sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
> > > balanced across these clusters.
> > >
> > > Do not clear SD_PREFER_SIBLING in the child domains to indicate to the
> > > load balancer that it should spread load among cluster siblings.
> > >
> > > Checks for capacity in update_sd_pick_busiest(),
> > > sched_balance_find_src_group(), and sched_balance_find_src_rq() prevent
> > > migrations from high- to low-capacity CPUs if the busiest group is not
> > > overloaded.
> > >
> > > CPUs with spare capacity, big or small, have always helped overloaded
> > > groups. Once the overloading condition disappears, misfit load will still
> > > be used to move high-utilization tasks to bigger CPUs if they have spare
> > > capacity.
> > >
> > > Adding the SD_PREFER_SIBLING flag shifts load balancing in shared-LLC
> > > domains from equalizing the number of idle CPUs to equalizing the number
> > > of running tasks. This also enables migrations among clusters from newly-
> > > idle load balance, where the outgoing task is already dequeued but the CPU
> > > has not yet transitioned to idle.
> > >
> > > Reviewed-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> > > Tested-by: Christian Loehle <christian.loehle@xxxxxxx>
> > > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@xxxxxxxxxxxxxxx>
> > > ---
> > > Changes in v5:
> > > * Improved inline comments for accuracy.
> > > * Added Tested-by tag from Christian. Thanks!
> > >
> > > Changes in v4:
> > > * Added Reviewed-by tag from Tim. Thanks!
> > >
> > > Changes in v3:
> > > * Updated documentation of SD_PREFER_SIBLING.
> > > * Expanded the patch description to explain the behavior when overloaded
> > > groups are involved.
> > >
> > > Changes in v2:
> > > * Reworded the patch description for clarity.
> > > * Kept parentheses around bitwise operators for clarity.
> > > ---
> > > include/linux/sched/sd_flags.h | 3 ++-
> > > kernel/sched/topology.c | 14 ++++++++++++--
> > > 2 files changed, 14 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
> > > index 42839cfa2778..f9a46fb8cacf 100644
> > > --- a/include/linux/sched/sd_flags.h
> > > +++ b/include/linux/sched/sd_flags.h
> > > @@ -147,7 +147,8 @@ SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
> > > * Prefer to place tasks in a sibling domain
> > > *
> > > * Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
> > > - * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
> > > + * flag, but cleared below domains with SD_ASYM_CPUCAPACITY unless those child
> > > + * domains have clusters of CPUs sharing cache.
> > > *
> > > * NEEDS_GROUPS: Load balancing flag.
> > > */
> > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > index 622e2e01974c..261b407d0936 100644
> > > --- a/kernel/sched/topology.c
> > > +++ b/kernel/sched/topology.c
> > > @@ -1995,8 +1995,18 @@ sd_init(struct sched_domain_topology_level *tl,
> > > /*
> > > * Convert topological properties into behaviour.
> > > */
> > > - /* Don't attempt to spread across CPUs of different capacities. */
> > > - if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> > > + /*
> > > + * Don't attempt to spread across CPUs of different capacities.
> > > + *
> > > + * If the child domain has clusters of CPUs sharing L2 cache, keep the
> > > + * flag to spread tasks across clusters of identical capacity. Checks in
> > > + * the load balancer prevent task migrations from high- to low-capacity
> > > + * CPUs unless the source group is overloaded. Migrations to a lower-
> > > + * capacity CPU can happen if a higher-capacity group is overloaded and
> > > + * a lower-capacity CPU has spare capacity.
> > > + */
> > > + if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
> > > + !(sd->child->flags & SD_CLUSTER))
> > > sd->child->flags &= ~SD_PREFER_SIBLING;
> >
> > Last time I looked at this patch I was balanced between your proposal
> > above and simply keeping SD_PREFER_SIBLING for all HMP topologies. As
> > added in the comment:
> > " Checks in
> > * the load balancer prevent task migrations from high- to low-capacity
> > * CPUs unless the source group is overloaded.
> > "
> > So, why should we bother for (SD_ASYM_CPUCAPACITY && !SD_CLUSTER) topology ?
>
> No reason, AFAICS. I just wanted to restrict the change to the target
> topology of this patchset.
>
> But you raise a good point: given the checks in place in the load balancer,
> it should be OK to keep SD_PREFER_SIBLING in all asymmetric topologies. I
> will run a few experiments to confirm.

I ran a few experiments with and without CONFIG_CLUSTER_SCHED. I ran N
threads where N < nproc to ensure that sched groups were classified as
has_spare or fully_busy. The threads saturated the CPUs to minimize task
placement decisions at wake up.

I observed these threads to remain on the CPUs with highest capacity; no
spreading.

I repeated the experiment with EAS enabled and threads ramping up
utilization. EAS kept them on small CPUs and later duly moved to CPUs of
higher capacity as they became misfits.

I will update my patch to keep SD_PREFER_SIBLING regardless of asymmetric
capacity.