Re: [PATCH] sched/fair: Don't balance task to its current running CPU

From: Valentin Schneider
Date: Thu May 25 2023 - 06:09:02 EST


On 24/05/23 15:20, Yicong Yang wrote:
> From: Yicong Yang <yangyicong@xxxxxxxxxxxxx>
>
> We've run into the case that the balancer tries to balance a migration
> disabled task and trigger the warning in set_task_cpu() like below:
>
> ------------[ cut here ]------------
> WARNING: CPU: 7 PID: 0 at kernel/sched/core.c:3115 set_task_cpu+0x188/0x240
> Modules linked in: hclgevf xt_CHECKSUM ipt_REJECT nf_reject_ipv4 <...snip>
> CPU: 7 PID: 0 Comm: swapper/7 Kdump: loaded Tainted: G O 6.1.0-rc4+ #1
> Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V5.B221.01 12/09/2021
> pstate: 604000c9 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> pc : set_task_cpu+0x188/0x240
> lr : load_balance+0x5d0/0xc60
> sp : ffff80000803bc70
> x29: ffff80000803bc70 x28: ffff004089e190e8 x27: ffff004089e19040
> x26: ffff007effcabc38 x25: 0000000000000000 x24: 0000000000000001
> x23: ffff80000803be84 x22: 000000000000000c x21: ffffb093e79e2a78
> x20: 000000000000000c x19: ffff004089e19040 x18: 0000000000000000
> x17: 0000000000001fad x16: 0000000000000030 x15: 0000000000000000
> x14: 0000000000000003 x13: 0000000000000000 x12: 0000000000000000
> x11: 0000000000000001 x10: 0000000000000400 x9 : ffffb093e4cee530
> x8 : 00000000fffffffe x7 : 0000000000ce168a x6 : 000000000000013e
> x5 : 00000000ffffffe1 x4 : 0000000000000001 x3 : 0000000000000b2a
> x2 : 0000000000000b2a x1 : ffffb093e6d6c510 x0 : 0000000000000001
> Call trace:
> set_task_cpu+0x188/0x240
> load_balance+0x5d0/0xc60
> rebalance_domains+0x26c/0x380
> _nohz_idle_balance.isra.0+0x1e0/0x370
> run_rebalance_domains+0x6c/0x80
> __do_softirq+0x128/0x3d8
> ____do_softirq+0x18/0x24
> call_on_irq_stack+0x2c/0x38
> do_softirq_own_stack+0x24/0x3c
> __irq_exit_rcu+0xcc/0xf4
> irq_exit_rcu+0x18/0x24
> el1_interrupt+0x4c/0xe4
> el1h_64_irq_handler+0x18/0x2c
> el1h_64_irq+0x74/0x78
> arch_cpu_idle+0x18/0x4c
> default_idle_call+0x58/0x194
> do_idle+0x244/0x2b0
> cpu_startup_entry+0x30/0x3c
> secondary_start_kernel+0x14c/0x190
> __secondary_switched+0xb0/0xb4
> ---[ end trace 0000000000000000 ]---
>
> Further investigation shows that the warning is superfluous, the migration
> disabled task is just going to be migrated to its current running CPU.
> This is because that on load balance if the dst_cpu is not allowed by the
> task, we'll re-select a new_dst_cpu as a candidate. If no task can be
> balanced to dst_cpu we'll try to balance the task to the new_dst_cpu
> instead. In this case when the migration disabled task is not on CPU it
> only allows to run on its current CPU, load balance will select its
> current CPU as new_dst_cpu and later triggers the the warning above.
>
> This patch tries to solve this by not select the task's current running
> CPU as new_dst_cpu in the load balance.
>
> Signed-off-by: Yicong Yang <yangyicong@xxxxxxxxxxxxx>

Thanks!

Reviewed-by: Valentin Schneider <vschneid@xxxxxxxxxx>