Re: [PATCH 1/2] sched/fair: Record the average duration of a task

From: Raghavendra K T
Date: Wed Jul 03 2024 - 04:35:10 EST

Next message: Andreas Ziegler: "[PATCH] libbpf: add NULL checks to bpf_object__{prev_map,next_map}"
Previous message: Christian Brauner: "Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput -33.7% regression"
In reply to: Mike Galbraith: "Re: [PATCH 1/2] sched/fair: Record the average duration of a task"
Next in thread: Mike Galbraith: "Re: [PATCH 1/2] sched/fair: Record the average duration of a task"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 7/1/2024 8:27 PM, Chen Yu wrote:

Hi Mike,

On 2024-07-01 at 08:57:25 +0200, Mike Galbraith wrote:

On Sun, 2024-06-30 at 21:09 +0800, Chen Yu wrote:

Hi Mike,

Thanks for your time and giving the insights.

According to a test conducted last month on a system with 500+ CPUs where 4 CPUs
share the same L2 cache, around 20% improvement was noticed (though not as much
as on the non-L2 shared platform). I haven't delved into the details yet, but my
understanding is that L1 cache-to-cache latency within the L2 domain might also
matter on large servers (which I need to investigate further).

1:N or M:N
tasks can approach its wakeup frequency range, and there's nothing you can do
about the very same cache to cache latency you're trying to duck, it
just is what it is, and is considered perfectly fine as it is. That's
a bit of a red flag, but worse is the lack of knowledge wrt what tasks
are actually up to at any given time. We rashly presume that tasks
waking one another implies a 1:1 relationship, we routinely call them
buddies and generally get away with it.. but during any overlap they
can be doing anything including N way data share, and regardless of
what that is and section size, needless stacking flushes concurrency,
injecting service latency in its place, cost unknown.

I believe this is a generic issue that the current scheduler faces, where
it attempts to predict the task's behavior based on its runtime. For instance,
task_hot() checks the task runtime to predict whether the task is cache-hot,
regardless of what the task does during its time slice. This is also the case
with WF_SYNC, which provides the scheduler with a hint to wake up on the current
CPU to potentially benefit from cache locality.

A thought occurred to me that one possible method to determine if the waker
and wakee share data could be to leverage the NUMA balance's numa_group data structure.
As numa balance periodically scans the task's VMA space and groups tasks accessing
the same physical page into one numa_group, we can infer that if the waker and wakee
are within the same numa_group, they are likely to share data, and it might be
appropriate to place the wakee on top of the waker.

CC Raghavendra here in case he has any insights.

Agree with your thought here,

So I imagine two possible things to explore here.

1) Use task1, task2 numa_group and check if they belong to same
numa_group, also check if there is a possibility of M:N relationship
by checking if t1/t2->numa_group->nr_tasks > 1 etc

2) Given a VMA we can use vma_numab_state pids_active[] if task1, task2
(threads) possibly interested in same VMA.
Latter one looks to be practically difficult because we don't want to
sweep across VMAs perhaps..

thanks,
Chenyu

Next message: Andreas Ziegler: "[PATCH] libbpf: add NULL checks to bpf_object__{prev_map,next_map}"
Previous message: Christian Brauner: "Re: [linux-next:master] [lockref] d042dae6ad: unixbench.throughput -33.7% regression"
In reply to: Mike Galbraith: "Re: [PATCH 1/2] sched/fair: Record the average duration of a task"
Next in thread: Mike Galbraith: "Re: [PATCH 1/2] sched/fair: Record the average duration of a task"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]