Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous

From: Chen, Yu C

Date: Wed Feb 18 2026 - 10:22:51 EST

On 2/18/2026 11:28 AM, K Prateek Nayak wrote:

Hello Tim,

On 2/18/2026 4:42 AM, Tim Chen wrote:

On Tue, 2026-02-17 at 13:39 +0530, K Prateek Nayak wrote:

Hello Chenyu,

[...snip...]

   */
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
DEFINE_PER_CPU(int, sd_llc_size);
-DEFINE_PER_CPU(int, sd_llc_id);
+DEFINE_PER_CPU(int, sd_llc_id) = -1;
DEFINE_PER_CPU(int, sd_share_id);
DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
@@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
      rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
      per_cpu(sd_llc_size, cpu) = size;
-    per_cpu(sd_llc_id, cpu) = id;
      rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
      sd = lowest_flag_domain(cpu, SD_CLUSTER);
@@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
      /* Set up domains for CPUs specified by the cpu_map: */
      for_each_cpu(i, cpu_map) {
-        struct sched_domain_topology_level *tl;
+        struct sched_domain_topology_level *tl, *tl_llc = NULL;
+        int lid;
          sd = NULL;
          for_each_sd_topology(tl) {
+            int flags = 0;
+
+            if (tl->sd_flags)
+                flags = (*tl->sd_flags)();
+
+            if (flags & SD_SHARE_LLC)
+                tl_llc = tl;

nit. This loop breaks out when sched_domain_span(sd) covers the entire
cpu_map and it might have not reached the topmost SD_SHARE_LLC domain
yet. Is that cause for any concern?

Could you please elaborate a little more on this? If it covers the
entire cpu_map shouldn't it stop going up to its parent domain?
Do you mean, sd_llc_1 and its parent sd_llc_2 could cover the same cpu_map,
and we should let tl_llc to assigned to sd_llc_2 (sd_llc_1 be degenerated? )

I'm not sure if this is technically possible but assume following
topology:

[ LLC: 8-15 ]
[ SMT: 8,9 ][ SMT: 10,11 ] ... [ SMT: 14,15 ]

and the following series of events:

o All CPUs in LLC are offline to begin with (maxcpus = 1 like scenario).

o CPUs 10-15 are onlined first.

o CPU8 is put in a separate root partition and brought online.
(XXX: I'm not 100% sure if this is possible in this order)

o build_sched_domains() will bail out at SMT domain since the cpumap
is covered by tl->mask() and tl_llc = tl_smt.

o llc_id calculation uses the tl_smt->mask() which will not contain
CPUs 10-15 and CPU8 will get a unique LLC id even though there are
other online CPUs in the LLC with a different llc_id (!!!)

Instead, if we traversed to tl_mc, we would have seen all the online
CPUs in the MC and reused the llc_id from them. Might not be an issue on
its own but if this root partition is removed later, CPU8 will continue
to have the unique llc_id even after merging into the same MC domain.

There is really no reason to reuse the llc_id as far as cache aware scheduling
goes in its v3 revision (see my reply to Madadi on this patch).

Even I don't mind having some holes in the llc_id space when CPUs are
offlined but my major concern would be seeing an inconsistent state
where CPUs in same MC domains end up with different llc_id when after
a bunch of hotplug activity.

I am thinking that if we just simply rebuild LLC id across sched domain
rebuilds, that is probably the cleanest solution.

Tim, do you mean reset all CPUs' LLC id to -1 whenever there is hotplug
event in partition_sched_domains_locked(), and rebuild them from scratch
in build_sched_domains(), so we already refresh the LLC id for every
CPU(I discussed with Vineeth here:
https://lore.kernel.org/all/54e60704-b0f3-44df-9b83-070806b5a00c@xxxxxxxxx/)

There could be some races
in cpus_share_cache() as llc_id gets reassigned for some CPUs when they
come online/offline. But we also having similar races in current mainline code.
Worst it can do is some temporary sub-optimal scheduling task placement.

Thoughts?

If you are suggesting populating the sd_llc_id for all the CPUs on
topology rebuild, I'm not entirely against the idea.

On a separate note, if we add a dependency on SCHED_MC for SCHED_CACHE,
we can simply look at cpu_coregroup_mask() and either allocate a new
llc_id / borrow llc id in sched_cpu_activate() when CPU is onlined or
reassign them in sched_cpu_deactivate() if an entire LLC is offlined.

Prateek, may I know if you are thinking of updating every CPU's LLC id
during its hotplug and not update all percpu LLC id in build_sched_domains()?

thanks,
Chenyu