Re: [PATCH v5 6/6] sched/topology: Do not clear SD_PREFER_SIBLING in domains with clusters

From: Vincent Guittot

Date: Fri Jun 26 2026 - 10:55:51 EST


On Fri, 26 Jun 2026 at 02:10, Ricardo Neri
<ricardo.neri-calderon@xxxxxxxxxxxxxxx> wrote:
>
> On Tue, Jun 23, 2026 at 10:14:57PM -0700, Ricardo Neri wrote:
> > On Tue, Jun 23, 2026 at 09:26:57AM +0200, Vincent Guittot wrote:
> > > On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> > > <ricardo.neri-calderon@xxxxxxxxxxxxxxx> wrote:
> > > >
> > > > Some topologies have scheduling domains that contain CPUs of asymmetric
> > > > capacity, grouped into two or more clusters of equal-capacity CPUs
> > > > sharing an L2 cache. When CONFIG_SCHED_CLUSTER is enabled, load must be
> > > > balanced across these clusters.
> > > >
> > > > Do not clear SD_PREFER_SIBLING in the child domains to indicate to the
> > > > load balancer that it should spread load among cluster siblings.
> > > >
> > > > Checks for capacity in update_sd_pick_busiest(),
> > > > sched_balance_find_src_group(), and sched_balance_find_src_rq() prevent
> > > > migrations from high- to low-capacity CPUs if the busiest group is not
> > > > overloaded.
> > > >
> > > > CPUs with spare capacity, big or small, have always helped overloaded
> > > > groups. Once the overloading condition disappears, misfit load will still
> > > > be used to move high-utilization tasks to bigger CPUs if they have spare
> > > > capacity.
> > > >
> > > > Adding the SD_PREFER_SIBLING flag shifts load balancing in shared-LLC
> > > > domains from equalizing the number of idle CPUs to equalizing the number
> > > > of running tasks. This also enables migrations among clusters from newly-
> > > > idle load balance, where the outgoing task is already dequeued but the CPU
> > > > has not yet transitioned to idle.
> > > >
> > > > Reviewed-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
> > > > Tested-by: Christian Loehle <christian.loehle@xxxxxxx>
> > > > Signed-off-by: Ricardo Neri <ricardo.neri-calderon@xxxxxxxxxxxxxxx>
> > > > ---
> > > > Changes in v5:
> > > > * Improved inline comments for accuracy.
> > > > * Added Tested-by tag from Christian. Thanks!
> > > >
> > > > Changes in v4:
> > > > * Added Reviewed-by tag from Tim. Thanks!
> > > >
> > > > Changes in v3:
> > > > * Updated documentation of SD_PREFER_SIBLING.
> > > > * Expanded the patch description to explain the behavior when overloaded
> > > > groups are involved.
> > > >
> > > > Changes in v2:
> > > > * Reworded the patch description for clarity.
> > > > * Kept parentheses around bitwise operators for clarity.
> > > > ---
> > > > include/linux/sched/sd_flags.h | 3 ++-
> > > > kernel/sched/topology.c | 14 ++++++++++++--
> > > > 2 files changed, 14 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/include/linux/sched/sd_flags.h b/include/linux/sched/sd_flags.h
> > > > index 42839cfa2778..f9a46fb8cacf 100644
> > > > --- a/include/linux/sched/sd_flags.h
> > > > +++ b/include/linux/sched/sd_flags.h
> > > > @@ -147,7 +147,8 @@ SD_FLAG(SD_ASYM_PACKING, SDF_NEEDS_GROUPS)
> > > > * Prefer to place tasks in a sibling domain
> > > > *
> > > > * Set up until domains start spanning NUMA nodes. Close to being a SHARED_CHILD
> > > > - * flag, but cleared below domains with SD_ASYM_CPUCAPACITY.
> > > > + * flag, but cleared below domains with SD_ASYM_CPUCAPACITY unless those child
> > > > + * domains have clusters of CPUs sharing cache.
> > > > *
> > > > * NEEDS_GROUPS: Load balancing flag.
> > > > */
> > > > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > > > index 622e2e01974c..261b407d0936 100644
> > > > --- a/kernel/sched/topology.c
> > > > +++ b/kernel/sched/topology.c
> > > > @@ -1995,8 +1995,18 @@ sd_init(struct sched_domain_topology_level *tl,
> > > > /*
> > > > * Convert topological properties into behaviour.
> > > > */
> > > > - /* Don't attempt to spread across CPUs of different capacities. */
> > > > - if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child)
> > > > + /*
> > > > + * Don't attempt to spread across CPUs of different capacities.
> > > > + *
> > > > + * If the child domain has clusters of CPUs sharing L2 cache, keep the
> > > > + * flag to spread tasks across clusters of identical capacity. Checks in
> > > > + * the load balancer prevent task migrations from high- to low-capacity
> > > > + * CPUs unless the source group is overloaded. Migrations to a lower-
> > > > + * capacity CPU can happen if a higher-capacity group is overloaded and
> > > > + * a lower-capacity CPU has spare capacity.
> > > > + */
> > > > + if ((sd->flags & SD_ASYM_CPUCAPACITY) && sd->child &&
> > > > + !(sd->child->flags & SD_CLUSTER))
> > > > sd->child->flags &= ~SD_PREFER_SIBLING;
> > >
> > > Last time I looked at this patch I was balanced between your proposal
> > > above and simply keeping SD_PREFER_SIBLING for all HMP topologies. As
> > > added in the comment:
> > > " Checks in
> > > * the load balancer prevent task migrations from high- to low-capacity
> > > * CPUs unless the source group is overloaded.
> > > "
> > > So, why should we bother for (SD_ASYM_CPUCAPACITY && !SD_CLUSTER) topology ?
> >
> > No reason, AFAICS. I just wanted to restrict the change to the target
> > topology of this patchset.
> >
> > But you raise a good point: given the checks in place in the load balancer,
> > it should be OK to keep SD_PREFER_SIBLING in all asymmetric topologies. I
> > will run a few experiments to confirm.
>
> I ran a few experiments with and without CONFIG_CLUSTER_SCHED. I ran N
> threads where N < nproc to ensure that sched groups were classified as
> has_spare or fully_busy. The threads saturated the CPUs to minimize task
> placement decisions at wake up.
>
> I observed these threads to remain on the CPUs with highest capacity; no
> spreading.
>
> I repeated the experiment with EAS enabled and threads ramping up
> utilization. EAS kept them on small CPUs and later duly moved to CPUs of
> higher capacity as they became misfits.
>
> I will update my patch to keep SD_PREFER_SIBLING regardless of asymmetric
> capacity.

Thanks