Re: [PATCH v3 00/21] Cache Aware Scheduling

From: Chen, Yu C

Date: Fri Feb 20 2026 - 21:49:07 EST

On 2/20/2026 11:25 AM, Qais Yousef wrote:

On 02/19/26 23:07, Chen, Yu C wrote:

Hi Peter, Qais,

On 2/19/2026 10:41 PM, Peter Zijlstra wrote:

On Thu, Feb 19, 2026 at 02:08:28PM +0000, Qais Yousef wrote:

On 02/10/26 14:18, Tim Chen wrote:

[ ... ]

became much more stable, so we kept it as is. The wakeup path more or less
aggregates the wakees(threads within the same process) within the LLC in the
wakeup fast path, so we have not changed it for now.

How expensive is it to use the new push lb, which unifies the decision with
wake up path, to detect these bad task placement and steer them back to the
right LLC? I think if we can construct the trigger right, we can simplify the
load balance to keep tagged tasks within the same LLC much easier. In my view
this bad task placement is just a new type of misfit where a task has strayed
from its group for whatever reason at wake up and it is not sleeping and waking
up again to be placed back with its clan - assuming the conditions has changed
to warrant the move - which the wake up path should handle anyway.

FWIW, I have been experimenting to use push lb to keep regular LB off and rely
solely on it to manage the important corner cases (including overloaded one)
- and seeing *very* promising results. But the systems I work with are small
compared to yours.

But essentially if we can construct the system to keep the wakeup path (via
regular sleep/wakeup cycle and push lb) maintain the system relatively balanced
and delay regular LB for when we need to do large intervention, we can simplify
the problem space significantly IMHO. If the LB had to kick in, then the delays
of not finding enough bandwidth to run are larger than the delays of not
sharing the hottest LLC. IOW, keep the regular LB as-is for true load balance
and handle the small exceptions via natural sleep/wakeup cycle or push lb.

Leveraging push-lb for cache-aware task placement is interesting,
and we have considered it during LPC when Vincent and Prateek presented it.
It could be an enhancement to the basic cache-aware scheduling, IMO.
Tim has mentioned that in
https://lore.kernel.org/all/4514b6aef56d0ae144ebd56df9211c6599744633.camel@xxxxxxxxxxxxxxx/
a bouncing issue needs to be resolved if task wakeup and push-lb are
leveraged for cache-aware scheduling. They are very fast - so for cache-aware
scheduling, it is possible that multiple invocations of select_idle_sibling()
will find the same LLC suitable. Then multiple wakees are woken up on that LLC,
causing over-aggregation. Later, when over-aggregation is detected, several
tasks are migrated out of the LLC, which makes the LLC eligible again-and the
pattern repeats back and forth.

Let me copy the changelog from the previous patch version:

"
In previous versions, aggregation of tasks were done in the
wake up path, without making load balancing paths aware of
LLC (Last-Level-Cache) preference. This led to the following
problems:

1) Aggregation of tasks during wake up led to load imbalance
between LLCs
2) Load balancing tried to even out the load between LLCs
3) Wake up tasks aggregation happened at a faster rate and
load balancing moved tasks in opposite directions, leading
to continuous and excessive task migrations and regressions
in benchmarks like schbench.

Note this is an artefact of tagging all tasks belonging to the process as
co-dependent. So somehow this is a case of shooting one self in the foot
because processes with large number of tasks will create large imbalances and
will start to require special handling. I guess the question, were they really
that packed which means the steering logic needed to relax a little bit and say
hey, this is an overcommit I must spill to the other LLCs, or was it really
okay to pack them all in one LLC and LB was overzealous to kick in and needed
to be aware the new case is not really a problem that requires its
intervention?

In this version, load balancing is made cache-aware. The main
idea of cache-aware load balancing consists of two parts:

I think this might work under the conditions you care about. But will be hard
to generalize. But I might need to go and read more.

Note I am mainly concerned because the wake up path can't stay based purely on
load forever and need to be able to do smarter decisions (latency being the
most important one in the horizon). And they all will hit this problem. I think
we need to find a good recipe for how to handle these problems in general.
I don't think we can extend the LB to be energy aware, latency aware, cache
aware etc without hitting a lot of hurdles. And it is too slow to react.

1) Identify tasks that prefer to run on their hottest LLC and
move them there.
2) Prevent generic load balancing from moving a task out of
its hottest LLC.

Isn't this 2nd part the fix to the wake up problem you faced? 1 should
naturally be happening at wake up. And for random long running strayed tasks,
I believe push lb is an easier way to manage them.

This is doable and some logic needs to be added in wakeup/push lb to
avoid the bouncing issue mentioned above. Consider both whether do it
in task wakeup/push lb/generic lb, and the task tagging, I was thinking that
creating threads within one process appears to be a special case of tagging.
If the user chooses to create threads rather than forking new processes,
is it a higher potential for data sharing among those threads? However,
we agree that fine-grained tagging is necessary. How about this: if the
user explicitly tags tasks into a single group, the kernel can perform
aggressive task aggregation-for instance, in the wakeup/fair-push path - and
let the user accept the corresponding risks. For the default model, generic
load balancing can perform per-process task aggregation at a slower pace to
reduce the risk of false decisions and over-aggregation. We intended to discuss
this in a separate thread, though.

Thanks,
Chenyu