Re: [RFC PATCH v5 0/9] steal tasks to improve CPU utilization

From: chenjinghuang

Date: Fri Mar 27 2026 - 22:50:33 EST

On 3/20/2026 4:53 PM, Peter Zijlstra wrote:
> On Fri, Mar 20, 2026 at 05:59:11AM +0000, Chen Jinghuang wrote:
>> When a CPU has no more CFS tasks to run, and idle_balance() fails to
>> find a task, then attempt to steal a task from an overloaded CPU in the
>> same LLC. Maintain and use a bitmap of overloaded CPUs to efficiently
>> identify candidates. To minimize search time, steal the first migratable
>> task that is found when the bitmap is traversed. For fairness, search
>> for migratable tasks on an overloaded CPU in order of next to run.
>
> This makes no sense. It is the task of newidle to get more work -- if
> that is failing for you, then we should fix that, not build a second way
> to get tasks.
>
>
>
Hi Peter,

That is a very valid point. However, profiling data indicates that
newidle_balance() still incurs excessive scanning overhead. Sharing the
expensive tick balance path is becoming an issue on high-core-count
systems.

As highlighted in recent ILB threads, profiling sqlite on a 224-CPU
Sapphire Rapids shows:

https://lore.kernel.org/all/cover.1690273854.git.yu.c.chen@xxxxxxxxx/

6.69% 0.09% sqlite3 [kernel.kallsyms] [k] newidle_balance
5.39% 4.71% sqlite3 [kernel.kallsyms] [k] update_sd_lb_stats

Walking the sd hierarchy in update_sd_lb_stats alone consumes over 5% of
CPU cycles. If we spend all that time scanning just to pull a tiny,
short-lived task, the search overhead dwarfs the actual runtime. It's a net
loss.

To mitigate this cost, I'd like to check if you are open to these two
directions:

1.Goal: As Tim questioned in the thread above:

"Do we always have to find the busiest group and pull from it? Would a
relatively busy group be enough?"

Instead of chasing absolute fairness, shouldn't newidle_balance()
prioritize fast task acquisition? A task-stealing mechanism can be
effective in this regard.

2.Refactoring: Instead of a standalone fix, we could track LLC overload in
sched_domain_shared and add a lightweight fast path atop newidle_balance(),
gated by rq->avg_idle.

Do you think refactoring along these lines to integrate into the existing
framework makes sense?

Thanks, Chen Jinghuang