Re: [PATCH v3 00/21] Cache Aware Scheduling

From: Qais Yousef

Date: Mon Feb 23 2026 - 21:49:40 EST

On 02/20/26 10:43, Peter Zijlstra wrote:
> On Fri, Feb 20, 2026 at 03:29:41AM +0000, Qais Yousef wrote:
>
> > What's the reason wake up doesn't have the latest info? Is this a limitation of
> > these large systems where stats updates are too expensive to do? Is it not
> > fixable at all?
>
> Scalability is indeed the main problem. The periodic load-balancer, by
> virtue of being 'slow' has two advantages:
>
> - the cost of aggregating the numbers is amortized by the relative low
> frequency of aggregation
>
> - it can work with averages; it is less concerned with immediate
> spikes.
>
> This obviously has the exact inverse set of problems in that it is not
> able to deal with immediate/short term issues.

Yes. And if we are to focus on providing better task placement based on QoS
(which what I think this is essentially is), we have a constant problem of two
paths producing results that are incompatible. Which is why I am trying to
stress the importance of the wake up path. I understand for this initial drop
we don't have a way to provide specific hints for tasks, but this is why we end
up with this difficult choices always - which I think we don't have to.

More on this at the bottom.

>
>
> Anyway, we're already at the point where EAS wakeup path is getting far
> too expensive for the current set of hardware. While we started with a
> handful of asymmetric CPUs, we're now pushing 32 CPUs or so.

Is this 32 perf domains? Expensive for what workloads? Folks can still use
performance governor and plug it to a wall if they want ;-)

>
> (Look at Intel Nova Lake speculation online, that's supposedly going to
> get us 2 dies of 8P+16E with another 4 bonus weaklings on the south
> bridge or something, for a grand total of 52 asymmetric CPUs of 3 kinds)

Not sure if my experience matters for whatever this is supposed to be used for,
but the cost of wrong decision is really high on these topologies. It is bloody
worthwhile spending more time to select a better CPU and worthwhile to have the
push lb do frequent corrections. Not sure if you saw the other thread on one of
Vincent's patches - but I am trying to completely disable overutilized (or
regular LB) and rely on wakeup + push lb and seeing great success (and gains).
But I am carrying a number of improvements that I discussed in various places
on the list that makes this effective setup. Hopefully I'll share full findings
properly at some point.

>
>
> Then consider:
>
> - Intel Granite Rapids-SP at 8*86 cores for 688 cores / 1376 threads.
>
> - AMD Prometheus at 2*192 cores with 384 cores / 768 threads. These
> are silly number of CPUs.
>
> - Power10, it is something like 16 sockets, 16 cores per socket, 8
> threads per core for a mere 2048 threads.
>
> Now, these are the extreme end of the spectrum systems, 'nobody' will
> actually have them, but in a few generations they'll seem small again.
>
>
> So whatever we build now, will have to deal with silly numbers of CPUs.

True, but I think we ought to bite the bullet at some point. My line of
thought is that we don't have to (and actually shouldn't) make the compromise
at the kernel level. We can define the problem such that it is opt-in/opt-out
where users who find the benefit can opt-in or find a disadvantage opt-out.

Now the difficulty is that we don't have a way to describe such things, and
this is what I am trying to solve with Sched QoS library. I am writing this
now, but I think I should be able to help with this use case so that users can
describe which workload wants to benefit from co-locating and these tasks will
take the hit of harder task placement and frequent migration under loaded
scenarios - the contract being that being co-located has significant
performance impact they are happy to pay the price. Things that didn't
subscribe will work as-is.

Anyway, my major goal is to find how we can tie all these stories together as
we need to add ability to do task placement based on special requirements and
the conflict with LB is one major one that I think Vincent's proposal for push
lb is quite neat and spot on. I am not sure if you saw our LPC talk about Sched
QoS where we expanded on our overall thoughts.

In my view, this problem belongs to the same class of problems of placement
based on special requirements (latency, energy, cache etc) and hopefully we can
address along the way. But if not, it would be good to know more so we can
think how we can better incorporate it as part of the bigger story.

So far I think if this can be made to go through the wake up path and rely on
push lb; it is part of the same story. If not, then we need to think harder how
to connect things together for a coherent approach.

If I can successfully give you a way to describe the requirement of tasks needs
to be co-located so that we don't have to make the assumption in the kernel
that tasks belonging to the same process needs to stay in the same LLC, do you
think wake up + push lb works? If not, how do you see it evolving? And more
importantly, how do you view the role of regular LB in these cases? The way
I see it is that it should trigger less for the reasons you mentioned at the
top; and when it triggers it means heavy intervention is required and whatever
special task placement requirements will need to be dropped at this stage since
the push lb clearly failed to keep up and we are at a point where we need to do
heavy handed balancing work. I think this activities are more relevant to
multi-LLC systems - which has the added problem of defining when some
imbalances are okay; which I believe the difficulty being hit here with wakeup
path based approach. For single LLC systems I think this heavy handed approach
can be made unnecessary if we do it correctly.

Sorry a bit of divergent. But I am interested on how we can all move the ship
in the same direction. I think this is all part of making the wake up path
multi-modal and improve its co-ordination with LB.