Re: [PATCH v3 00/21] Cache Aware Scheduling

From: Tim Chen

Date: Thu Feb 19 2026 - 16:47:49 EST

On Thu, 2026-02-19 at 19:48 +0000, Qais Yousef wrote:
> On 02/19/26 15:41, Peter Zijlstra wrote:
> > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > > On 02/10/26 14:18, Tim Chen wrote:
> > > > This patch series introduces infrastructure for cache-aware load
> > > > balancing, with the goal of co-locating tasks that share data within
> > > > the same Last Level Cache (LLC) domain. By improving cache locality,
> > > > the scheduler can reduce cache bouncing and cache misses, ultimately
> > > > improving data access efficiency. The design builds on the initial
> > > > prototype from Peter [1].
> > > >
> > > > This initial implementation treats threads within the same process as
> > > > entities that are likely to share data. During load balancing, the
> > >
> > > This is a very aggressive assumption. From what I've seen, only few tasks truly
> > > share data. Lumping everything in a process together is an easy way to
> > > classify, but I think we can do better.
> >
> > Not without more information. And that is something we can always add
> > later. But like you well know, it is an uphill battle to get programs to
> > explain/annotate themselves.
>
> Yes. I think we should be able to come up with a daemon to profile a workload
> on a machine and come up with a recommendation of tasks that have data
> co-dependency.
>
> Note I strongly against programs specifying this themselves. We need to provide
> a service that helps with the correct tagging - ie: it is an admin only
> operation.
>
> >
> > The alternative is sampling things using the PMU, see which process is
> > trying to access which data, but that too is non-trivial, not to mention
> > it will get people really upset for consuming PMU resources.
>
> I was hoping we can tell which data structures are shared between tasks with
> perf?
>
> I am thinking this is not something that need to run continuously. But
> disocvered one time off on a machine or once every update. The profiling can be
> done once (on demand) I believe.
>
> Still if someone really wants to tag all the tasks for a process to stay
> together, I think this is fine if that's what they want.

I can envision that with tagging tasks with the same cookie that's analogous
to what we are doing for core scheduling. Or grouping tasks by tagging a
cgroup.

>
> >
> > Starting things with a simple assumption is fine. This can always be
> > extended. Gotta start somewhere and all that. It currently groups things
> > by mm_struct, but it would be fairly straight forward to allow userspace
> > to group tasks manually.
> >
> > > > scheduler attempts to aggregate such threads onto the same LLC domain
> > > > whenever possible.
> > >
> > > I admit yet to look fully at the series. But I must ask, why are you deferring
> > > to load balance and not looking at wake up path? LB should be for corrections.
> > > When wake up path is doing wrong decision all the time, LB (which is super slow
> > > to react) is too late to start grouping tasks? What am I missing?
> >
> > There used to be wakeup steering, but I'm not sure that still exists in
> > this version (still need to read beyond the first few patches). It isn't
> > hard to add.
> >
> > But I think Tim and Chen have mostly been looking at 'enterprise'
> > workloads.
> >
> > > In my head Core Scheduling is already doing what we want. We just need to
> > > extend it to be a bit more relaxed (best effort rather than completely strict
> > > for security reasons today). This will be a lot more flexible and will allow
> > > tasks to be co-located from the get-go. And it will defer the responsibility of
> > > tagging to userspace. If they do better or worse, it's on them :) It seems you
> > > already hit a corner case where the grouping was a bad idea and doing some
> > > magic with thread numbers to alleviate it.
> >
> > No, Core scheduling does completely the wrong thing. Core scheduling is
> > set up to do co-scheduling, because that's what was required for that
> > whole speculation trainwreck. And that is very much not what you want or
> > need here.
> >
> > You simply want a preference to co-locate things that use the same data.
> > Which really is a completely different thing.
>
> Hmm. Isn't the infra the same? We have a group of tasks tagged with a cookie
> that needs to be co-located. Core scheduling is strict to keep them on the same
> physical core, but the concept can be extended to co-locate on LLC or closest
> cache?
>

In my understanding, core scheduling doesn't try to place the tasks
with the same cookie on the same core, but the tasks can safely
be scheduled together in SMTs on a core.

However, we can certainly use a similar cookie mechanism to indicate
tasks should be scheduled close to each other cache wise.

> >
> > > FWIW I have come across cases on mobile world were co-locating on a cluster or
> > > a 'big' core with big L2 cache can benefit a small group of tasks. So the
> > > concept is generally beneficial as cache hierarchies are not symmetrical in
> > > more systems now. Even on symmetrical systems, there can be cases made where
> > > two small data dependent task can benefit from packing on a single CPU.
> >
> > Sure, we all know this. pipe-bench is a prime example, it flies if you
> > co-locate them on the same CPU. It tanks if you pull them apart (except
> > SMT siblings, those are mostly good too).
>
> +1
>
> >
> > > I know this changes the direction being made here; but I strongly believe the
> > > right way is to extend wake up path rather than lump it solely in LB (IIUC).
> >
> > You're really going to need both, and LB really is the more complicated
> > part. On a busy/loaded system, LB will completely wreck things for you
> > if it doesn't play ball.
>
> Yes I wasn't advocating for wake up both only of course. But I didn't read all
> the details but I saw no wake up done.
>
> And generally as I think I have been indicating here and there; we do need to
> unify the wakeup and LB decision tree. With push lb this unification become
> a piece of cake if the wakeup path already handles the case. The current LB
> is a big beast. And will be slow to react for many systems.

I think as long as we have up to date information on load at the time of push
in push lb, so we don't cause over aggregation and too much load imbalance,
it will be viable to make such aggregation at wake up.

Tim