Re: [Patch v4 00/22] Cache aware scheduling

From: Tim Chen

Date: Tue Apr 21 2026 - 16:57:49 EST

On Tue, 2026-04-21 at 01:34 +0100, Qais Yousef wrote:
> On 04/20/26 17:01, Chen, Yu C wrote:
> > On 4/16/2026 8:27 AM, Qais Yousef wrote:
> > > On 04/01/26 14:52, Tim Chen wrote:
> >
> > [ ... ]
> >
> > >
> > > I posted schedqos announcement yesterday, which I think (hope) would be the
> > > right way to address these concerns about tagging tasks.
> > >
> > > https://lore.kernel.org/lkml/20260415000910.2h5misvwc45bdumu@airbuntu/
> > >
> >

I think that's great. Will be a nice way to tag tasks that should be grouped
and aggregated together.

> > Thanks, I'll take a look at this.
> >
> > > It would be trivial to add experimental branch to add new QoS flavour to say
> > > NUMA_SENSITIVE etc. I am still trying to think of a generic description to
> > > address a number of use cases (see Execution Profiles in README.md), not just
> > > this particular numa sensitive one, but the experimental branch should help
> > > iterate and drive the kernel development for wake up path + push lb instead of
> > > using load balance which I really doubt will work well in practice since this
> > > is slow to react, and you're relying on overcommitting the system by default by
> > > making every task of every process data dependent and require it to be
> > > co-located.
> >
> > I am not certain which strategy is preferable, as it largely depends
> > on the use case and workload. We intend to evaluate push-based load
> > balancing on top of the existing lb-based cache-aware placement logic.
>
> I'll defer to Vincent here, but I would have thought lb-based approach can go
> away.
>
> >
> > > I think in practice admins will care about specific applications to
> > > be kept within a single LLC, and if they are willing to spend the effort, they
> > > can tag specific tasks of a specific application.
> > >
> >
> > It seems to me that there are multiple use cases. In one scenario,
> > the administrator (including daemons) is responsible for tagging
> > workloads. In another, users prefer the OS to handle automatic
> > placement without any userspace involvement.
>
> How do you define this automatic placement? AFAICS you're just grouping all
> tasks of a specific process to stay within the same LLC and hitting overcommit
> issues which you're workingaround with this load balancer only based approach?

The LLC chosen for aggregation (preferred LLC) is the one with most occupancy
by tasks in a process.

However aggregation needs to be done with the load in target LLC and current
in mind. It is better to keep a task in its current LLC if plenty of idle CPUs are available
than move to an LLC where most of the threads are, but CPUs are frequently
busy. This is the main reason why we put the migration logic in the load
balancer where accurate load information is available and we could put
in load aware migration policy.

It is fine to migrate tasks in the wake up path. But we need to resolve
the issue of over-aggregation, when multiple CPUs may push tasks
to a LLC independently of each other. It worsen things with
frequent tasks bouncing if we over-aggregate
and have to migrate tasks out of the LLC again. We encounter
such issues in our earlier implementations that have task migrations in the wake up path.

>
> I think in practice there will be many corner cases where state is not optimal
> and we'd end up with heuristics to 'balance' things out and sensitivity to
> independent changes disturbing this fragile balance causing weird regressions
> and us slowly has less flexibility to move and shuffle code (okay, maybe too
> much doom and gloom, but we've been by this in the past :)).
>
> I am not sure how many of these tests stressed the system with multiple
> critical processes running concurrently?
>
> By making it a userspace problem they have to figure out the right balance and
> we can focus on providing the right mechanism.
>
> >
> > > Also QoS IMHO should be viewed as a scarce resource. For best effort delivery
> > > (which is the best we can do in reality, this is not hard real time system), it
> > > is easier to provide good best effort when the average noise level is low, ie:
> > > few tasks are required to be kept within the same LLC. If we overcommit often,
> > > we will crumble often. So IMHO the key is to delegate to userspace to tag, and
> >
> > I suppose there are two scenarios. The first is enabling/disabling
> > aggregation
> > for a group of tasks, and the second is task tagging. For the first
> > scenario,
> > this can be applied either process-wide or cgroup-wide by providing a flag,
>
> Cgroup-wide tagging doesn't make sense IMO. Process-wide yes.
>

I think this depends on the usage scenario. In private discussion with
Vern from Tencent, he mentioned that such a cgroup based tagging is useful for them.

Tim

> What does it mean to group all processes in the same cgroup from cache locality
> PoV? It just seems random setup based on something specific in userspace on how
> these cgroups are setup that assumes one process per group? I don't think we
> can generalize if that's the case.
>
> Admins can use cpuset to statically partition based on cgroup if they want to
> ensure a group of processes are confined to the same LLC?
>
> > without requiring users to explicitly tag individual tasks. The second
> > scenario
> > is an enhancement to support fine-grained control over a specific task. If
> > schedqos only supports scenario2, the user has to tag every task to support
> > scenario1.
>
> You can do tagging process-wide (as I mentioned in the announcement, I think
> it's a poor man's way for quick tagging until people learn to do better), not
> just per-task eg:
>
> {
> "PostgreSQL": {
> "qos": ["QOS_USER_INTERACTIVE", "QOS_NUMA_SENSITIVE"]
> }
> }
>
> which will tag every task forked by this binary with a cookie and QoS data
> dependency tag tell the scheduler all tasks with the same cookie need to stay
> within the same LLC.
>
> We can bikeshed naming and the tagging details, but the actual implementation
> principal should be the same: keep tasks with the same cookie and data dep tag
> on the same LLC at wake up; and let push lb handle occasional strayed task.
>
> If users want to be smart and tag specific tasks only, the implementation would
> be identical, it's just there are fewer tasks tagged.