Re: [PATCH 00/19] Cache Aware Scheduling

From: Madadi Vineeth Reddy

Date: Wed Oct 15 2025 - 14:27:31 EST

On 15/10/25 11:08, Chen, Yu C wrote:
> On 10/15/2025 5:48 AM, Tim Chen wrote:
>> On Tue, 2025-10-14 at 17:43 +0530, Madadi Vineeth Reddy wrote:
>>> Hi Tim,
>>> Thanks for the patch.
>>>
>>> On 11/10/25 23:54, Tim Chen wrote:
>
> [snip]
>
>>>> [Genoa details]
>>>> [ChaCha20-xiangshan]
>>>> ChaCha20-xiangshan is a simple benchmark using a static build of an
>>>> 8-thread Verilator of XiangShan(RISC-V). The README file can be
>>>> found here[2]. The score depends on how aggressive the user set the
>>>> /sys/kernel/debug/sched/llc_aggr_tolerance. Using the default values,
>>>> there is no much difference observed. While setting the
>>>> /sys/kernel/debug/sched/llc_aggr_tolerance to 100, 44% improvment is
>>>> observed.
>>>>
>>>> baseline:
>>>> Host time spent: 50,868ms
>>>>
>>>> sched_cache:
>>>> Host time spent: 28,349ms
>>>>
>>>> The time has been reduced by 44%.
>>>
>>> Milan showed no improvement across all benchmarks, which could be due to the
>>> CCX topology (8 CCXs × 8 CPUs) where the LLC domain is too small for this
>>> optimization to be effective. Moreover there could be overhead due to additional
>>> computations.
>>>
>>> ChaCha20-xiangshan improvement in Genoa when llc_aggr_tolerance is set to 100 seems
>>> due to having relatively lesser thread count. Please provide the numbers
>>> with default values too. Would like to know numbers on varying loads.
>>
>> I'll ask Chen Yu who did the Xiangshan experiments if he has those numbers.
>>
>
> Madadi, do you mean the performance score number or active thread number
> when llc_aggr_tolerance is set to 1(default)?
> The score is around with sched_cache and llc_aggr_tolerance set to 1.
> The active number is 128 per process, and there are 8 processes when
> launching the benchmark. I suppose the 128 comes from the number
> of online CPUs. Please let me know if you need more data.
>
> Cced Yangyu who's the author of this benchmark.

I mean the benchmark result with default value of llc_aggr_tolerance on Genoa
in comparison to baseline. Knowing number of threads also helps to understand
the impact.

>
> ls -l /proc/14460/task/ | grep -c '^d'
> 128
>
>>>
>>> In Power 10 and Power 11, the LLC size is 4 threads which is even smaller. Not
>>> expecting improvements here but will run some workloads and share the data.
>>>
>>> Not gone through the entire series yet but are the situations like say in two
>>> NUMA system, if a task's preferred LLC is on the wrong NUMA node for its memory,
>>> which takes precedence?
>>
>> We take preferred NUMA node in the consideration but we do not force task to
>> go to the preferred node.
>>
>> I remembered initially we limited the consideration to only LLCs in the
>> preferred node. But we encountered regressions in hackbench and schbench,
>> because when the preferred node don't have any occupancy resulting in preferred LLC
>> to be set to -1 (no preference), and resulted in extra task migrations.
>> And also the preferred node for hackbench and schbench was volatile
>> as they have small memory footprint. Chen Yu, please chime in if there
>> were other reasons you remembered.
>>
>
> Since the preferred NUMA node is per task, while the preferred LLC
> is per process, scanning only the current task's preferred node
> would lead to cross-node migration. This is because the process's
> preferred LLC may not reside within the current task's preferred
> node. Such a scenario could leave curr_m_a_occ at 0, and any LLC
> with an occupancy > 0 would then trigger a preferred LLC switch.

Understood. Thanks for the context.

>
>> We'll need to revisit this part of the code to take care of such
>> corner case. I think ideally we should move tasks to the least loaded LLC
>> in the preferred node (even if no LLCs have occupancy in the preferred node),
>> as long as preferred NUMA node don't changes too often.
>>
>>
>
> Then we might need to introduce a new member in mm_struct to store the old
> occupancy, curr_m_a_occ, so that we can reliably compare the old and new
> occupancy - to avoid the 0 value of curr_m_a_occ.
>
>>>
>>> Also, what about the workloads that don't share data like stress-ng?
>>>
>
> The stream is single process stressing the memory without any share
> data, we did not observe any difference on stream. We can launch more
> tests on stress-ng.
>

That would be helpful.

Thanks,
Madadi Vineeth Reddy

> thanks,
> Chenyu>
>> We can test those. Ideally the controls to prevent over aggregation to preferred LLC
>> would keep stress-ng happy.
>>
>>> It will
>>> be good to make sure that most other workloads don't suffer. As mentioned,
>>> per process knob for llc_aggr_tolerance could help.
>>
>> Agree. We are planning to add per process knob for the next version. One thought is to use
>> prctl. Any other suggestions are welcome.
>>
>