Re: [PATCH v3 00/21] Cache Aware Scheduling

From: Qais Yousef

Date: Thu Feb 19 2026 - 22:25:37 EST


On 02/19/26 23:07, Chen, Yu C wrote:
> Hi Peter, Qais,
>
> On 2/19/2026 10:41 PM, Peter Zijlstra wrote:
> > On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:
> > > On 02/10/26 14:18, Tim Chen wrote:
>
> [ ... ]
>
> > >
> > > I admit yet to look fully at the series. But I must ask, why are you deferring
> > > to load balance and not looking at wake up path? LB should be for corrections.
> > > When wake up path is doing wrong decision all the time, LB (which is super slow
> > > to react) is too late to start grouping tasks? What am I missing?
> >
> > There used to be wakeup steering, but I'm not sure that still exists in
> > this version (still need to read beyond the first few patches). It isn't
> > hard to add.
> >
>
> Please let me explain a little more about why we did this in the
> load balance path. Yes, the original version implemented cache-aware
> scheduling only in the wakeup path. According to our testing, this appeared
> to cause some task bouncing issues across LLCs. This was due to conflicts
> with the legacy load balancer, which tries to spread tasks to different
> LLCs.
> So as Peter said, the load balancer should be taken care of anyway. Later,
> we kept only the cache aware logic in the load balancer, and the test
> results

Yes, we need both. My concern is that the origin is for wake up path to keep
tasks placed correctly as most task wake up and sleep often and this is the
common case. If the decision tree is not unified, we will have problems. And
this is not a specific problem to doing placement based on memory dependency.
We need to extend the wake up path to do placement based on latency. Placement
based on energy (EAS) has the same problem too. It disabled LB altogether,
which is a problem we are trying to fix if you saw the other discussion about
overutilized handling. Load balancer can destroy energy balance easily and it
has no notion of how to distribute based on energy. This is a recurring theme
for any new task placement decision that is not purely based on load. The LB
will wreck havoc.

> became much more stable, so we kept it as is. The wakeup path more or less
> aggregates the wakees(threads within the same process) within the LLC in the
> wakeup fast path, so we have not changed it for now.

How expensive is it to use the new push lb, which unifies the decision with
wake up path, to detect these bad task placement and steer them back to the
right LLC? I think if we can construct the trigger right, we can simplify the
load balance to keep tagged tasks within the same LLC much easier. In my view
this bad task placement is just a new type of misfit where a task has strayed
from its group for whatever reason at wake up and it is not sleeping and waking
up again to be placed back with its clan - assuming the conditions has changed
to warrant the move - which the wake up path should handle anyway.

FWIW, I have been experimenting to use push lb to keep regular LB off and rely
solely on it to manage the important corner cases (including overloaded one)
- and seeing *very* promising results. But the systems I work with are small
compared to yours.

But essentially if we can construct the system to keep the wakeup path (via
regular sleep/wakeup cycle and push lb) maintain the system relatively balanced
and delay regular LB for when we need to do large intervention, we can simplify
the problem space significantly IMHO. If the LB had to kick in, then the delays
of not finding enough bandwidth to run are larger than the delays of not
sharing the hottest LLC. IOW, keep the regular LB as-is for true load balance
and handle the small exceptions via natural sleep/wakeup cycle or push lb.

>
> Let me copy the changelog from the previous patch version:
>
> "
> In previous versions, aggregation of tasks were done in the
> wake up path, without making load balancing paths aware of
> LLC (Last-Level-Cache) preference. This led to the following
> problems:
>
> 1) Aggregation of tasks during wake up led to load imbalance
> between LLCs
> 2) Load balancing tried to even out the load between LLCs
> 3) Wake up tasks aggregation happened at a faster rate and
> load balancing moved tasks in opposite directions, leading
> to continuous and excessive task migrations and regressions
> in benchmarks like schbench.

Note this is an artefact of tagging all tasks belonging to the process as
co-dependent. So somehow this is a case of shooting one self in the foot
because processes with large number of tasks will create large imbalances and
will start to require special handling. I guess the question, were they really
that packed which means the steering logic needed to relax a little bit and say
hey, this is an overcommit I must spill to the other LLCs, or was it really
okay to pack them all in one LLC and LB was overzealous to kick in and needed
to be aware the new case is not really a problem that requires its
intervention?

>
> In this version, load balancing is made cache-aware. The main
> idea of cache-aware load balancing consists of two parts:

I think this might work under the conditions you care about. But will be hard
to generalize. But I might need to go and read more.

Note I am mainly concerned because the wake up path can't stay based purely on
load forever and need to be able to do smarter decisions (latency being the
most important one in the horizon). And they all will hit this problem. I think
we need to find a good recipe for how to handle these problems in general.
I don't think we can extend the LB to be energy aware, latency aware, cache
aware etc without hitting a lot of hurdles. And it is too slow to react.

>
> 1) Identify tasks that prefer to run on their hottest LLC and
> move them there.
> 2) Prevent generic load balancing from moving a task out of
> its hottest LLC.

Isn't this 2nd part the fix to the wake up problem you faced? 1 should
naturally be happening at wake up. And for random long running strayed tasks,
I believe push lb is an easier way to manage them.