Re: [CHANGE 1/2] sched/isolation: Make use of more than one housekeeping cpu

From: Waiman Long
Date: Tue Feb 18 2025 - 10:34:43 EST


On 2/18/25 10:30 AM, Phil Auld wrote:
On Tue, Feb 18, 2025 at 10:23:50AM -0500 Waiman Long wrote:
On 2/18/25 10:00 AM, Phil Auld wrote:
Hi Vishal.

On Fri, Feb 14, 2025 at 11:08:19AM +0530 Vishal Chourasia wrote:
Hi Phil, Vineeth

On Thu, Feb 13, 2025 at 09:26:53AM -0500, Phil Auld wrote:
On Thu, Feb 13, 2025 at 10:14:04AM +0530 Madadi Vineeth Reddy wrote:
Hi Phil Auld,

On 11/02/25 19:31, Phil Auld wrote:
The exising code uses housekeeping_any_cpu() to select a cpu for
a given housekeeping task. However, this often ends up calling
cpumask_any_and() which is defined as cpumask_first_and() which has
the effect of alyways using the first cpu among those available.

The same applies when multiple NUMA nodes are involved. In that
case the first cpu in the local node is chosen which does provide
a bit of spreading but with multiple HK cpus per node the same
issues arise.

Spread the HK work out by having housekeeping_any_cpu() and
sched_numa_find_closest() use cpumask_any_and_distribute()
instead of cpumask_any_and().

Got the overall intent of the patch for better load distribution on
housekeeping tasks. However, one potential drawback could be that by
spreading HK work across multiple CPUs might reduce the time that
some cores can spend in deeper idle states which can be beneficial for
power-sensitive systems.

Thoughts?
NOHZ_full setups are not generally used in power sensitive systems I think.
They aren't in our use cases at least.

In cases with many cpus a single housekeeping cpu can not keep up. Having
other HK cpus in deep idle states while the one in use is overloaded is
not a win.
To me, an overloaded CPU sounds like where more than one tasks are ready
to run, and a HK CPU is one receiving periodic scheduling clock
ticks, so HP CPU is bound to comes out of any power-saving state it is in.
If the overload is caused by HK and interrupts there is nothing in the
system to help. Tasks, sure, can get load balanced.

And as you say, the HK cpus will have generally ticks happening anyway.

If your single HK cpu can keep up then only configure that one HK cpu.
The others will go idle and stay there. And since they are nohz_full
might get to stay idle even longer.
While it is good to distribute the load across each HK CPU in the HK
cpumask (queuing jobs on different CPUs each time), this can cause
jitter in virtualized environments. Unnecessaryily evicting other
tenants, when it's better to overload a VP than to wake up other VPs of a
tenant.

Sorry I'm not sure I understand your setup. Are your running virtual
tenants on the HK cpus? nohz_full in the guests? Maybe you only need
on HK then it won't matter.

My concern is that currently there is no point in having more than
one HK cpu (per node in a NUMA case). The code as currently implemented
is just not doing what it needs to.

We have numerous cases where a single HK cpu just cannot keep up and
the remote_tick warning fires. It also can lead to the other things
(orchastration sw, HA keepalives etc) on the HK cpus getting starved
which leads to other issues. In these cases we recommend increasing
the number of HK cpus. But... that only helps the userspace tasks
somewhat. It does not help the actual housekeeping part.
That is the part that should go into the commit log as well as it is the
rationale behind your patch.

Sure, I can add that piece and resend.

While at it, you can also add some text to address the other concerns that reviewers have so far.

Cheers,
Longman



Cheers,
Phil


Cheers,
Longman