Re: [Patch v4 00/22] Cache aware scheduling

From: Qais Yousef

Date: Mon Apr 20 2026 - 20:34:49 EST

On 04/20/26 17:01, Chen, Yu C wrote:
> On 4/16/2026 8:27 AM, Qais Yousef wrote:
> > On 04/01/26 14:52, Tim Chen wrote:
>
> [ ... ]
>
> >
> > I posted schedqos announcement yesterday, which I think (hope) would be the
> > right way to address these concerns about tagging tasks.
> >
> > https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/
> >
>
> Thanks, I'll take a look at this.
>
> > It would be trivial to add experimental branch to add new QoS flavour to say
> > NUMA_SENSITIVE etc. I am still trying to think of a generic description to
> > address a number of use cases (see Execution Profiles in README.md), not just
> > this particular numa sensitive one, but the experimental branch should help
> > iterate and drive the kernel development for wake up path + push lb instead of
> > using load balance which I really doubt will work well in practice since this
> > is slow to react, and you're relying on overcommitting the system by default by
> > making every task of every process data dependent and require it to be
> > co-located.
>
> I am not certain which strategy is preferable, as it largely depends
> on the use case and workload. We intend to evaluate push-based load
> balancing on top of the existing lb-based cache-aware placement logic.

I'll defer to Vincent here, but I would have thought lb-based approach can go
away.

>
> > I think in practice admins will care about specific applications to
> > be kept within a single LLC, and if they are willing to spend the effort, they
> > can tag specific tasks of a specific application.
> >
>
> It seems to me that there are multiple use cases. In one scenario,
> the administrator (including daemons) is responsible for tagging
> workloads. In another, users prefer the OS to handle automatic
> placement without any userspace involvement.

How do you define this automatic placement? AFAICS you're just grouping all
tasks of a specific process to stay within the same LLC and hitting overcommit
issues which you're workingaround with this load balancer only based approach?

I think in practice there will be many corner cases where state is not optimal
and we'd end up with heuristics to 'balance' things out and sensitivity to
independent changes disturbing this fragile balance causing weird regressions
and us slowly has less flexibility to move and shuffle code (okay, maybe too
much doom and gloom, but we've been by this in the past :)).

I am not sure how many of these tests stressed the system with multiple
critical processes running concurrently?

By making it a userspace problem they have to figure out the right balance and
we can focus on providing the right mechanism.

>
> > Also QoS IMHO should be viewed as a scarce resource. For best effort delivery
> > (which is the best we can do in reality, this is not hard real time system), it
> > is easier to provide good best effort when the average noise level is low, ie:
> > few tasks are required to be kept within the same LLC. If we overcommit often,
> > we will crumble often. So IMHO the key is to delegate to userspace to tag, and
>
> I suppose there are two scenarios. The first is enabling/disabling
> aggregation
> for a group of tasks, and the second is task tagging. For the first
> scenario,
> this can be applied either process-wide or cgroup-wide by providing a flag,

Cgroup-wide tagging doesn't make sense IMO. Process-wide yes.

What does it mean to group all processes in the same cgroup from cache locality
PoV? It just seems random setup based on something specific in userspace on how
these cgroups are setup that assumes one process per group? I don't think we
can generalize if that's the case.

Admins can use cpuset to statically partition based on cgroup if they want to
ensure a group of processes are confined to the same LLC?

> without requiring users to explicitly tag individual tasks. The second
> scenario
> is an enhancement to support fine-grained control over a specific task. If
> schedqos only supports scenario2, the user has to tag every task to support
> scenario1.

You can do tagging process-wide (as I mentioned in the announcement, I think
it's a poor man's way for quick tagging until people learn to do better), not
just per-task eg:

{
"PostgreSQL": {
"qos": ["QOS_USER_INTERACTIVE", "QOS_NUMA_SENSITIVE"]
}
}

which will tag every task forked by this binary with a cookie and QoS data
dependency tag tell the scheduler all tasks with the same cookie need to stay
within the same LLC.

We can bikeshed naming and the tagging details, but the actual implementation
principal should be the same: keep tasks with the same cookie and data dep tag
on the same LLC at wake up; and let push lb handle occasional strayed task.

If users want to be smart and tag specific tasks only, the implementation would
be identical, it's just there are fewer tasks tagged.