Re: 回复：[Internet]Re: 回复：[PATCH 06/19] sched/fair: Assign preferred LLC ID to processes

From: Tim Chen

Date: Tue Oct 14 2025 - 16:13:08 EST

On Tue, 2025-10-14 at 15:07 +0800, vernhao(郝信) wrote:
> Hi Tim,
>
> >
> >
[snip]

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 61c129bde8b6..d6167a029c47 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1312,6 +1312,7 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> > struct mm_struct *mm = p->mm;
> > struct mm_sched *pcpu_sched;
> > unsigned long epoch;
> > + int mm_sched_llc = -1;
> >
> > if (!sched_cache_enabled())
> > return;
> > @@ -1342,6 +1343,12 @@ void account_mm_sched(struct rq *rq, struct task_struct *p, s64 delta_exec)
> > if (mm->mm_sched_cpu != -1)
> > mm->mm_sched_cpu = -1;
> > }
> > +
> > + if (mm->mm_sched_cpu != -1)
> > + mm_sched_llc = per_cpu(sd_llc_id, mm->mm_sched_cpu);
> >
> > In high-concurrency multi-threaded scenarios, not all threads handle same events, so their hot data in the LLC is not completely shared.
> > Therefore, if every thread's preferred LLC is migrated to the LLC pointed to by mm->mm_sched_cpu, this would lead to the incorrect
> > assumption that all threads prefer the same LLC, thereby intensifying competition between LLCs.
>
> Yes, that's the reason why we stop aggregating to the preferred LLC once the the utilization of the
> LLC becomes too high relative to the other LLCs.
>
> But this approach is only a compensatory measure after the fact. The threads have already undergone incorrect migration to they are not perferred LLC.
> Is there a better way to handle this situation?

The threads would stay where they were instead of migrating to preferred LLC
that's overloaded.

>
> If you know your threads characteristics before hand on which of them
> share data together, you probably can use cgroup/cpuset
> from user space to separate out the threads.
>
> Yes, this is a solution, and I am trying to implement it.
>
> There's not enough info from occupancy data for OS to group
> the threads by data sharing. Perhaps an alternative if NUMA balancing
> is on is to group tasks by their task numa group instead of by mm.
>
> This may not be a good solution either, especially for virtual machine scenarios which has no NUMA.

If you are in a VM, the cache topology may not correspond to
real CPU cache topology and you probably should not enable cache
aware scheduling inside, unless you are doing some explicit
binding of VCPUs.

>
> That would incur the page scanning overhead etc and make
> cache aware scheduling be dependent on NUMA balancing.
>
>
> >
> > So I'm wondering, why not move ‘mm->mm_sched_cpu’ to ‘task_struct’, so that each thread can individually track its preferred LLC? What are the losses in doing so?
>
> You would need a way to group related tasks together and put them
> on the same LLC. Either group them by mm or some other means.
>
> Yes, you are right, how about this, beside in 'mm', add cgroup support too ？

Doing cgroup may not solve the original issue you brought
up, where a process may have a group of tasks wanting to go
into one cache and another group of tasks going to another cache.
I could be wrong but I don't think you can split up tasks in a process
in cgroup v2 to different cgroups.

Also the cgroup folks are quite resistant to adding new knobs.

Tim

>
> >
> > +
> > + if (p->preferred_llc != mm_sched_llc)
> > + p->preferred_llc = mm_sched_llc;
> > }
> >
> > static void task_tick_cache(struct rq *rq, struct task_struct *p)