Re: [PATCH v3 00/21] Cache Aware Scheduling

From: Peter Zijlstra

Date: Thu Feb 19 2026 - 09:42:16 EST

On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> On 02/10/26 14:18, Tim Chen wrote:
> > This patch series introduces infrastructure for cache-aware load
> > balancing, with the goal of co-locating tasks that share data within
> > the same Last Level Cache (LLC) domain. By improving cache locality,
> > the scheduler can reduce cache bouncing and cache misses, ultimately
> > improving data access efficiency. The design builds on the initial
> > prototype from Peter [1].
> >
> > This initial implementation treats threads within the same process as
> > entities that are likely to share data. During load balancing, the
>
> This is a very aggressive assumption. From what I've seen, only few tasks truly
> share data. Lumping everything in a process together is an easy way to
> classify, but I think we can do better.

Not without more information. And that is something we can always add
later. But like you well know, it is an uphill battle to get programs to
explain/annotate themselves.

The alternative is sampling things using the PMU, see which process is
trying to access which data, but that too is non-trivial, not to mention
it will get people really upset for consuming PMU resources.

Starting things with a simple assumption is fine. This can always be
extended. Gotta start somewhere and all that. It currently groups things
by mm_struct, but it would be fairly straight forward to allow userspace
to group tasks manually.

> > scheduler attempts to aggregate such threads onto the same LLC domain
> > whenever possible.
>
> I admit yet to look fully at the series. But I must ask, why are you deferring
> to load balance and not looking at wake up path? LB should be for corrections.
> When wake up path is doing wrong decision all the time, LB (which is super slow
> to react) is too late to start grouping tasks? What am I missing?

There used to be wakeup steering, but I'm not sure that still exists in
this version (still need to read beyond the first few patches). It isn't
hard to add.

But I think Tim and Chen have mostly been looking at 'enterprise'
workloads.

> In my head Core Scheduling is already doing what we want. We just need to
> extend it to be a bit more relaxed (best effort rather than completely strict
> for security reasons today). This will be a lot more flexible and will allow
> tasks to be co-located from the get-go. And it will defer the responsibility of
> tagging to userspace. If they do better or worse, it's on them :) It seems you
> already hit a corner case where the grouping was a bad idea and doing some
> magic with thread numbers to alleviate it.

No, Core scheduling does completely the wrong thing. Core scheduling is
set up to do co-scheduling, because that's what was required for that
whole speculation trainwreck. And that is very much not what you want or
need here.

You simply want a preference to co-locate things that use the same data.
Which really is a completely different thing.

> FWIW I have come across cases on mobile world were co-locating on a cluster or
> a 'big' core with big L2 cache can benefit a small group of tasks. So the
> concept is generally beneficial as cache hierarchies are not symmetrical in
> more systems now. Even on symmetrical systems, there can be cases made where
> two small data dependent task can benefit from packing on a single CPU.

Sure, we all know this. pipe-bench is a prime example, it flies if you
co-locate them on the same CPU. It tanks if you pull them apart (except
SMT siblings, those are mostly good too).

> I know this changes the direction being made here; but I strongly believe the
> right way is to extend wake up path rather than lump it solely in LB (IIUC).

You're really going to need both, and LB really is the more complicated
part. On a busy/loaded system, LB will completely wreck things for you
if it doesn't play ball.