Re: [PATCH v3 00/21] Cache Aware Scheduling

From: Qais Yousef

Date: Thu Feb 19 2026 - 14:49:16 EST

On 02/19/26 15:41, Peter Zijlstra wrote:
> On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > On 02/10/26 14:18, Tim Chen wrote:
> > > This patch series introduces infrastructure for cache-aware load
> > > balancing, with the goal of co-locating tasks that share data within
> > > the same Last Level Cache (LLC) domain. By improving cache locality,
> > > the scheduler can reduce cache bouncing and cache misses, ultimately
> > > improving data access efficiency. The design builds on the initial
> > > prototype from Peter [1].
> > >
> > > This initial implementation treats threads within the same process as
> > > entities that are likely to share data. During load balancing, the
> >
> > This is a very aggressive assumption. From what I've seen, only few tasks truly
> > share data. Lumping everything in a process together is an easy way to
> > classify, but I think we can do better.
>
> Not without more information. And that is something we can always add
> later. But like you well know, it is an uphill battle to get programs to
> explain/annotate themselves.

Yes. I think we should be able to come up with a daemon to profile a workload
on a machine and come up with a recommendation of tasks that have data
co-dependency.

Note I strongly against programs specifying this themselves. We need to provide
a service that helps with the correct tagging - ie: it is an admin only
operation.

>
> The alternative is sampling things using the PMU, see which process is
> trying to access which data, but that too is non-trivial, not to mention
> it will get people really upset for consuming PMU resources.

I was hoping we can tell which data structures are shared between tasks with
perf?

I am thinking this is not something that need to run continuously. But
disocvered one time off on a machine or once every update. The profiling can be
done once (on demand) I believe.

Still if someone really wants to tag all the tasks for a process to stay
together, I think this is fine if that's what they want.

>
> Starting things with a simple assumption is fine. This can always be
> extended. Gotta start somewhere and all that. It currently groups things
> by mm_struct, but it would be fairly straight forward to allow userspace
> to group tasks manually.
>
> > > scheduler attempts to aggregate such threads onto the same LLC domain
> > > whenever possible.
> >
> > I admit yet to look fully at the series. But I must ask, why are you deferring
> > to load balance and not looking at wake up path? LB should be for corrections.
> > When wake up path is doing wrong decision all the time, LB (which is super slow
> > to react) is too late to start grouping tasks? What am I missing?
>
> There used to be wakeup steering, but I'm not sure that still exists in
> this version (still need to read beyond the first few patches). It isn't
> hard to add.
>
> But I think Tim and Chen have mostly been looking at 'enterprise'
> workloads.
>
> > In my head Core Scheduling is already doing what we want. We just need to
> > extend it to be a bit more relaxed (best effort rather than completely strict
> > for security reasons today). This will be a lot more flexible and will allow
> > tasks to be co-located from the get-go. And it will defer the responsibility of
> > tagging to userspace. If they do better or worse, it's on them :) It seems you
> > already hit a corner case where the grouping was a bad idea and doing some
> > magic with thread numbers to alleviate it.
>
> No, Core scheduling does completely the wrong thing. Core scheduling is
> set up to do co-scheduling, because that's what was required for that
> whole speculation trainwreck. And that is very much not what you want or
> need here.
>
> You simply want a preference to co-locate things that use the same data.
> Which really is a completely different thing.

Hmm. Isn't the infra the same? We have a group of tasks tagged with a cookie
that needs to be co-located. Core scheduling is strict to keep them on the same
physical core, but the concept can be extended to co-locate on LLC or closest
cache?

>
> > FWIW I have come across cases on mobile world were co-locating on a cluster or
> > a 'big' core with big L2 cache can benefit a small group of tasks. So the
> > concept is generally beneficial as cache hierarchies are not symmetrical in
> > more systems now. Even on symmetrical systems, there can be cases made where
> > two small data dependent task can benefit from packing on a single CPU.
>
> Sure, we all know this. pipe-bench is a prime example, it flies if you
> co-locate them on the same CPU. It tanks if you pull them apart (except
> SMT siblings, those are mostly good too).

+1

>
> > I know this changes the direction being made here; but I strongly believe the
> > right way is to extend wake up path rather than lump it solely in LB (IIUC).
>
> You're really going to need both, and LB really is the more complicated
> part. On a busy/loaded system, LB will completely wreck things for you
> if it doesn't play ball.

Yes I wasn't advocating for wake up both only of course. But I didn't read all
the details but I saw no wake up done.

And generally as I think I have been indicating here and there; we do need to
unify the wakeup and LB decision tree. With push lb this unification become
a piece of cake if the wakeup path already handles the case. The current LB
is a big beast. And will be slow to react for many systems.