Re: [PATCH v3 04/21] sched/cache: Make LLC id continuous

From: K Prateek Nayak

Date: Tue Feb 17 2026 - 03:10:39 EST

Hello Chenyu,

On 2/17/2026 11:37 AM, Chen, Yu C wrote:
> Hi Prateek,
>
> On 2/16/2026 3:44 PM, K Prateek Nayak wrote:
>> Hello Tim, Chenyu,
>>
>> On 2/11/2026 3:48 AM, Tim Chen wrote:
>>> From: Chen Yu <yu.c.chen@xxxxxxxxx>
>>>
>>> Introduce an index mapping between CPUs and their LLCs. This provides
>>> a continuous per LLC index needed for cache-aware load balancing in
>>> later patches.
>>>
>>> The existing per_cpu llc_id usually points to the first CPU of the
>>> LLC domain, which is sparse and unsuitable as an array index. Using
>>> llc_id directly would waste memory.
>>>
>>> With the new mapping, CPUs in the same LLC share a continuous id:
>>>
>>>    per_cpu(llc_id, CPU=0...15) = 0
>>>    per_cpu(llc_id, CPU=16...31) = 1
>>>    per_cpu(llc_id, CPU=32...47) = 2
>>>    ...
>>>
>>> Once a CPU has been assigned an llc_id, this ID persists even when
>>> the CPU is taken offline and brought back online, which can facilitate
>>> the management of the ID.
>>>
>>> Co-developed-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
>>> Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
>>> Co-developed-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
>>> Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
>>> Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
>>> ---
>>>
>>> Notes:
>>>      v2->v3:
>>>      Allocate the LLC id according to the topology level data directly, rather
>>>      than calculating from the sched domain. This simplifies the code.
>>>      (Peter Zijlstra, K Prateek Nayak)
>>>
>>> kernel/sched/topology.c | 47 ++++++++++++++++++++++++++++++++++++++---
>>> 1 file changed, 44 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>>> index cf643a5ddedd..ca46b5cf7f78 100644
>>> --- a/kernel/sched/topology.c
>>> +++ b/kernel/sched/topology.c
>>> @@ -20,6 +20,7 @@ void sched_domains_mutex_unlock(void)
>>> /* Protected by sched_domains_mutex: */
>>> static cpumask_var_t sched_domains_tmpmask;
>>> static cpumask_var_t sched_domains_tmpmask2;
>>> +static int tl_max_llcs;
>>> static int __init sched_debug_setup(char *str)
>>> {
>>> @@ -658,7 +659,7 @@ static void destroy_sched_domains(struct sched_domain *sd)
>>>    */
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_llc);
>>> DEFINE_PER_CPU(int, sd_llc_size);
>>> -DEFINE_PER_CPU(int, sd_llc_id);
>>> +DEFINE_PER_CPU(int, sd_llc_id) = -1;
>>> DEFINE_PER_CPU(int, sd_share_id);
>>> DEFINE_PER_CPU(struct sched_domain_shared __rcu *, sd_llc_shared);
>>> DEFINE_PER_CPU(struct sched_domain __rcu *, sd_numa);
>>> @@ -684,7 +685,6 @@ static void update_top_cache_domain(int cpu)
>>>       rcu_assign_pointer(per_cpu(sd_llc, cpu), sd);
>>>       per_cpu(sd_llc_size, cpu) = size;
>>> -    per_cpu(sd_llc_id, cpu) = id;
>>>       rcu_assign_pointer(per_cpu(sd_llc_shared, cpu), sds);
>>>       sd = lowest_flag_domain(cpu, SD_CLUSTER);
>>> @@ -2567,10 +2567,18 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
>>>       /* Set up domains for CPUs specified by the cpu_map: */
>>>       for_each_cpu(i, cpu_map) {
>>> -        struct sched_domain_topology_level *tl;
>>> +        struct sched_domain_topology_level *tl, *tl_llc = NULL;
>>> +        int lid;
>>>           sd = NULL;
>>>           for_each_sd_topology(tl) {
>>> +            int flags = 0;
>>> +
>>> +            if (tl->sd_flags)
>>> +                flags = (*tl->sd_flags)();
>>> +
>>> +            if (flags & SD_SHARE_LLC)
>>> +                tl_llc = tl;
>>
>> nit. This loop breaks out when sched_domain_span(sd) covers the entire
>> cpu_map and it might have not reached the topmost SD_SHARE_LLC domain
>> yet. Is that cause for any concern?
>>
>
> Could you please elaborate a little more on this? If it covers the
> entire cpu_map shouldn't it stop going up to its parent domain?
> Do you mean, sd_llc_1 and its parent sd_llc_2 could cover the same cpu_map,
> and we should let tl_llc to assigned to sd_llc_2 (sd_llc_1 be degenerated? )

I'm not sure if this is technically possible but assume following
topology:

[ LLC: 8-15 ]
[ SMT: 8,9 ][ SMT: 10,11 ] ... [ SMT: 14,15 ]

and the following series of events:

o All CPUs in LLC are offline to begin with (maxcpus = 1 like scenario).

o CPUs 10-15 are onlined first.

o CPU8 is put in a separate root partition and brought online.
(XXX: I'm not 100% sure if this is possible in this order)

o build_sched_domains() will bail out at SMT domain since the cpumap
is covered by tl->mask() and tl_llc = tl_smt.

o llc_id calculation uses the tl_smt->mask() which will not contain
CPUs 10-15 and CPU8 will get a unique LLC id even though there are
other online CPUs in the LLC with a different llc_id (!!!)

Instead, if we traversed to tl_mc, we would have seen all the online
CPUs in the MC and reused the llc_id from them. Might not be an issue on
its own but if this root partition is removed later, CPU8 will continue
to have the unique llc_id even after merging into the same MC domain.

[..snip..]

>>
>> It doesn't compact tl_max_llcs, but it should promote reuse of llc_id if
>> all CPUs of a LLC go offline. I know it is a ridiculous scenario but it
>> is possible nonetheless.
>>
>> I'll let Peter and Valentin be the judge of additional space and
>> complexity needed for these bits :-)
>>
>
> Smart approach! Dynamically reallocating the llc_id should be feasible,
> as it releases the llc_id when the last CPU of that LLC is offlined. My
> only concern is data synchronization issues arising from the reuse of
> llc_id during load balancing - I’ll audit the logic to check for any race
> conditions. Alternatively, what if we introduce a tl->static_mask? It would
> be similar to tl->mask, but would not remove CPUs from static_mask when they
> are offlined. This way, we can always find and reuse the llc_id of CPUs in
> that LLC (even if all CPUs in the LLC have been offlined at some point,
> provided they were once online), and we would thus maintain a static llc_id.

That is possible but it would require a larger arch/ wide audit to add
support for. Might be less complex to handle in the generic layer but
again I'll let Peter and Valentin comment on this part :-)

>
> Anyway, let do some testings on your proposal as well as static_mask things,
> and I'll reply to this thread later. Thanks for the insights!

Thanks a ton! Much appreciated.

--
Thanks and Regards,
Prateek