Re: [Patch v4 00/22] Cache aware scheduling

From: Qais Yousef

Date: Thu Apr 23 2026 - 11:10:26 EST

On 04/21/26 13:57, Tim Chen wrote:
> On Tue, 2026-04-21 at 01:34 +0100, Qais Yousef wrote:
> > On 04/20/26 17:01, Chen, Yu C wrote:
> > > On 4/16/2026 8:27 AM, Qais Yousef wrote:
> > > > On 04/01/26 14:52, Tim Chen wrote:
> > >
> > > [ ... ]
> > >
> > > >
> > > > I posted schedqos announcement yesterday, which I think (hope) would be the
> > > > right way to address these concerns about tagging tasks.
> > > >
> > > > https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/
> > > >
> > >
>
> I think that's great. Will be a nice way to tag tasks that should be grouped
> and aggregated together.

To ensure it is clear, you can tag processes too, not just tasks. Though this
is wasteful, but will allow easy tagging and quick results.

>
> > > Thanks, I'll take a look at this.
> > >
> > > > It would be trivial to add experimental branch to add new QoS flavour to say
> > > > NUMA_SENSITIVE etc. I am still trying to think of a generic description to
> > > > address a number of use cases (see Execution Profiles in README.md), not just
> > > > this particular numa sensitive one, but the experimental branch should help
> > > > iterate and drive the kernel development for wake up path + push lb instead of
> > > > using load balance which I really doubt will work well in practice since this
> > > > is slow to react, and you're relying on overcommitting the system by default by
> > > > making every task of every process data dependent and require it to be
> > > > co-located.
> > >
> > > I am not certain which strategy is preferable, as it largely depends
> > > on the use case and workload. We intend to evaluate push-based load
> > > balancing on top of the existing lb-based cache-aware placement logic.
> >
> > I'll defer to Vincent here, but I would have thought lb-based approach can go
> > away.
> >
> > >
> > > > I think in practice admins will care about specific applications to
> > > > be kept within a single LLC, and if they are willing to spend the effort, they
> > > > can tag specific tasks of a specific application.
> > > >
> > >
> > > It seems to me that there are multiple use cases. In one scenario,
> > > the administrator (including daemons) is responsible for tagging
> > > workloads. In another, users prefer the OS to handle automatic
> > > placement without any userspace involvement.
> >
> > How do you define this automatic placement? AFAICS you're just grouping all
> > tasks of a specific process to stay within the same LLC and hitting overcommit
> > issues which you're workingaround with this load balancer only based approach?
>
> The LLC chosen for aggregation (preferred LLC) is the one with most occupancy
> by tasks in a process.
>
> However aggregation needs to be done with the load in target LLC and current
> in mind. It is better to keep a task in its current LLC if plenty of idle CPUs are available
> than move to an LLC where most of the threads are, but CPUs are frequently
> busy. This is the main reason why we put the migration logic in the load
> balancer where accurate load information is available and we could put
> in load aware migration policy.
>
> It is fine to migrate tasks in the wake up path. But we need to resolve
> the issue of over-aggregation, when multiple CPUs may push tasks
> to a LLC independently of each other. It worsen things with
> frequent tasks bouncing if we over-aggregate
> and have to migrate tasks out of the LLC again. We encounter
> such issues in our earlier implementations that have task migrations in the wake up path.

Hmm I am still struggling to see how we can end up with these problem when we
have few processes asking to be in the same LLC. With this series all processes
are being grouped which I can see how this can lead to excessive ping ponging.

But I acknowledge the task might not be straight forward due to lack of
accurate load information. But this can't be fixable at all under more
controlled and relaxed environment is where I struggle. We are simplifying the
problem essentially to help the kernel do better and easier.

Anyway. My goal is to help simplify the kernel details and defer the policy to
the userspace completely. I am happy to work with you and Vincent to make sure
we can address general QoS placement details on NUMA systems.

FWIW, if we want to implement wake up based on latency, it's the exact same
problems. That's why I am stressing this point. We need to be able to extend
and scale.

>
>
> >
> > I think in practice there will be many corner cases where state is not optimal
> > and we'd end up with heuristics to 'balance' things out and sensitivity to
> > independent changes disturbing this fragile balance causing weird regressions
> > and us slowly has less flexibility to move and shuffle code (okay, maybe too
> > much doom and gloom, but we've been by this in the past :)).
> >
> > I am not sure how many of these tests stressed the system with multiple
> > critical processes running concurrently?
> >
> > By making it a userspace problem they have to figure out the right balance and
> > we can focus on providing the right mechanism.
> >
> > >
> > > > Also QoS IMHO should be viewed as a scarce resource. For best effort delivery
> > > > (which is the best we can do in reality, this is not hard real time system), it
> > > > is easier to provide good best effort when the average noise level is low, ie:
> > > > few tasks are required to be kept within the same LLC. If we overcommit often,
> > > > we will crumble often. So IMHO the key is to delegate to userspace to tag, and
> > >
> > > I suppose there are two scenarios. The first is enabling/disabling
> > > aggregation
> > > for a group of tasks, and the second is task tagging. For the first
> > > scenario,
> > > this can be applied either process-wide or cgroup-wide by providing a flag,
> >
> > Cgroup-wide tagging doesn't make sense IMO. Process-wide yes.
> >
>
> I think this depends on the usage scenario. In private discussion with
> Vern from Tencent, he mentioned that such a cgroup based tagging is useful for them.

We all want ponies :)

I think this needs a why. It doesn't make sense to group procsses in general.
It seems this requirement is tied to elaborate setup to force the kernel to
deal with this elaborate setup in a generic manner.

Anyway with the tagging approach we can easily allow process level LLC sharing
via simple description like

// shared cookie definition
{
"WEB_SERVICE_COOKIE": [ "nginx", "postgresql"],
"TRANSCODING_COOKIE": [ "decoder", "encoder"]
}

Which simply tell the utility to reuse the cookie for these processes using the
key as a unique identifier.

By the way, cookie generation might need kernel help to create a unique id.

Still, if someone wants such elaborate setup the first thing to suggest is
static portioning via cpuset. Do you know why this is not sufficient?