Re: [Patch v4 00/22] Cache aware scheduling

From: Qais Yousef

Date: Wed Apr 15 2026 - 20:32:44 EST

On 04/01/26 14:52, Tim Chen wrote:
> This patch series introduces infrastructure for cache-aware load
> balancing, with the goal of co-locating tasks that share data within
> the same Last Level Cache (LLC) domain. By improving cache locality,
> the scheduler can reduce cache bouncing and cache misses, ultimately
> improving data access efficiency. The design builds on the initial
> prototype from Peter [1].
>
> This initial implementation treats threads within the same process
> as entities that are likely to share data. During load balancing, the
> scheduler attempts to aggregate such threads onto the same LLC domain
> whenever possible.
>
> Most of the feedback received on v3 has been addressed. Some aspects
> could be enhanced later after the basic cache-aware portion has landed:
>
> There were discussions around grouping tasks using mechanisms other
> than process membership. While we agree that more flexible grouping
> is desirable, this series intentionally focuses on establishing basic
> process-based grouping first, with alternative grouping mechanisms to
> be explored in a follow-on series.
>
> There was also discussion in v3 that the task wakeup path should be used
> to perform cache-aware scheduling. According to previous test results,
> performing task aggregation in the wakeup path introduced task migration
> bouncing. Primarily that was due to the wake up path not having the up
> to date LLC load information. That led to over-aggregation that needed
> to be corrected later in load balancing. Load balancing path was chosen
> as the conservative path to perform task aggregation. The task wakeup
> path will be investigated as a future enhancement.

I posted schedqos announcement yesterday, which I think (hope) would be the
right way to address these concerns about tagging tasks.

https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/

It would be trivial to add experimental branch to add new QoS flavour to say
NUMA_SENSITIVE etc. I am still trying to think of a generic description to
address a number of use cases (see Execution Profiles in README.md), not just
this particular numa sensitive one, but the experimental branch should help
iterate and drive the kernel development for wake up path + push lb instead of
using load balance which I really doubt will work well in practice since this
is slow to react, and you're relying on overcommitting the system by default by
making every task of every process data dependent and require it to be
co-located. I think in practice admins will care about specific applications to
be kept within a single LLC, and if they are willing to spend the effort, they
can tag specific tasks of a specific application.

We are trying to make sure we have one coherent story to define these type of
QoS requirements, and delegate to userspace to make these decisions/policies.
The current line of thinking is that wakeup + push lb should be generally good
to address the different needs for various placement requirements. I understood
you believe the same. If not, it would be good so we can think how to further
generalize.

Also QoS IMHO should be viewed as a scarce resource. For best effort delivery
(which is the best we can do in reality, this is not hard real time system), it
is easier to provide good best effort when the average noise level is low, ie:
few tasks are required to be kept within the same LLC. If we overcommit often,
we will crumble often. So IMHO the key is to delegate to userspace to tag, and
make them take responsibility of handling potential overcommit and decide which
workload is really important to tag and which one they can let go of or move to
another machine to get their desired perf/latencies.

For the kernel interface to tag tasks and set a cookie, I plan to rebase and
repost [1] as I need it for rampup multiplier to help counter DVFS related
latencies and slow migration in HMP systems. The idea was for it to be generic
and flexible to add whatever; in this case I think we need to add a QoS to tell
scheduler these tasks are data co-dependent with a unique cookie, which implies
that they need to stay within the same LLC, and can be extended to help keep
these tasks within the same L2 or L1 if it makes sense (ie: they are small and
can be packed on the same CPU).

I am not opposed to merging this first, but I think the load balanced based
approach is wrong and if merged must be removed later in favour of wakeup
+ push lb one with user space based tagging.

Based on what Peter said the wake up path is trivial to add, push lb is almost
ready, and I hope we have the tools to auto tag processes/tasks now to
potentially try to work on this approach first instead.

[1] https://lore.kernel.org/lkml/20240820163512.1096301-11-qyousef@xxxxxxxxxxx/