Re: [PATCH RFC 1/3] sched/pe: Exclude balance callback queuing during proxy()'s migrate

From: Joel Fernandes
Date: Thu Dec 15 2022 - 18:12:57 EST


On Mon, Dec 12, 2022 at 9:39 AM Dietmar Eggemann
<dietmar.eggemann@xxxxxxx> wrote:
>
> Hi Joel,
>
> On 09/12/2022 17:52, Joel Fernandes wrote:
> > Hi Dietmar!
> >
> > On Fri, Dec 9, 2022 at 3:07 PM Dietmar Eggemann
> > <dietmar.eggemann@xxxxxxx> wrote:
> >>
> >> On 23/11/2022 02:21, Joel Fernandes (Google) wrote:
>
> [...]
>
> >> You consider rt_mutex_setprio() and __sched_setscheduler() versus
> >> proxy() but what about all the other places like load_balance(),
> >> update_blocked_averages(), __set_cpus_allowed_ptr() and many
> >> more in which we take the rq lock (via task_rq_lock() or
> >> rq_lock{_xxx}())?
> >
> > You are right. Those paths potentially need updates as well. Any
>
> IMHO, you would still have to lock the entire `queue->execute` (1)->(2)
> thing, like keeping the rq lock currently.
>
> __schedule()
>
> pick_next_task()
> pick_next_task_{rt/dl}()
> set_next_task_{rt/dl}()
> {rt/deadline}_queue_push_tasks()
> queue_balance_callback() --> (1)
>
> proxy() --- !!!
>
> finish_lock_switch()
> __balance_callbacks() <-- (2)
>
> __balance_callbacks(rq) <-- (2)

Instead of locking it throughout, I think we can keep my initial patch
and just execute the balance callbacks in proxy() before dropping the
rq lock. I *think* that will make it work properly, but I could be
missing something.

Anyway I see the issue you bring up, I took care of balance CBs queued
from *other CPUs* while the rq lock is dropped, but the current CPU
executing proxy() could itself have queued balance callbacks. Dropping
the rq lock then causes other CPUs to see the proxy() CPU's balance
CBs in the callback list.

Anyway I will try the above and get back to you. Thanks so much again
for the insights. Will test as you suggested below.

Thanks,

- Joel


> Otherwise. something like this could happen:
>
> With `x-x` : smp_processor_id()-cpu_of(rq)
>
> lock_torture_wr-1745 [003] 338.270963: queue_balance_callback: 3-3->
> lock_torture_wr-1745 [003] 338.270965: queue_balance_callback: 3-3<-
> cat-1726 [001] 338.270969: __schedule: proxy() 1-1->
> lock_torture_wr-1745 [003] 338.270971: __schedule: proxy() 3-3->
> cat-1726 [001] 338.270972: __schedule: proxy() 1-1<-
> lock_torture_wr-1745 [003] 338.270979: __schedule: proxy() 3-3<-
> <idle>-0 [002] 338.270981: __schedule: proxy() 2-2->
> <idle>-0 [002] 338.270984: __schedule: proxy() 2-2<-
> lock_torture_wr-1745 [003] 338.270985: __schedule: proxy() 3-3->
> migration/4-34 [004] 338.270989: active_load_balance_cpu_stop: rq_pin_lock() 4-3 <-- ! cb on CPU3 still enqueued
>
> > chance you could post stack traces or logs of those issues, just in
> > case they have new nuggets of information? If you don't have them,
> > don't bother, I'll reproduce it.
>
> insmod /lib/modules/torture.ko random_shuffle=1 lock_torture_writer_fifo=1
> insmod /lib/modules/locktorture.ko torture_type=mutex_lock nlocks=3
>
> [ 447.046916] rq->balance_callback && rq->balance_callback != &balance_push_callback
> [ 447.046926] WARNING: CPU: 1 PID: 1648 at kernel/sched/sched.h:1583 task_rq_lock+0x148/0x170
> [ 447.062914] Modules linked in: locktorture(O) torture(O)
> [ 447.068254] CPU: 1 PID: 1648 Comm: torture_shuffle Tainted: G W O 6.1.0-rc2-00036-g397ce37b37a8-dirty #203
> [ 447.079168] Hardware name: ARM Juno development board (r0) (DT)
> [ 447.085106] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 447.092095] pc : task_rq_lock+0x148/0x170
> [ 447.096121] lr : task_rq_lock+0x148/0x170
> [ 447.100146] sp : ffff80000b85bd30
> ...
> [ 447.175138] Call trace:
> [ 447.177589] task_rq_lock+0x148/0x170
> [ 447.181267] __set_cpus_allowed_ptr+0x34/0xc0
> [ 447.185643] set_cpus_allowed_ptr+0x30/0x60
> [ 447.189843] torture_shuffle+0x158/0x224 [torture]
> [ 447.194666] kthread+0x10c/0x110
> [ 447.197906] ret_from_fork+0x10/0x20
> ...
> [ 447.233560] ---[ end trace 0000000000000000 ]---
>
>
> [ 446.542532] ------------[ cut here ]------------
> [ 446.553224] rq->balance_callback && rq->balance_callback != &balance_push_callback
> [ 446.553243] WARNING: CPU: 3 PID: 0 at kernel/sched/sched.h:1583 update_blocked_averages+0x784/0x78c
> [ 446.576089] Modules linked in: locktorture(O+) torture(O)
> [ 446.581551] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G O 6.1.0-rc2-00036-g397ce37b37a8-dirty #203
> [ 446.591723] Hardware name: ARM Juno development board (r0) (DT)
> [ 446.597691] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 446.604712] pc : update_blocked_averages+0x784/0x78c
> [ 446.609723] lr : update_blocked_averages+0x784/0x78c
> [ 446.614733] sp : ffff80000b403b70
> ...
> [ 446.690142] Call trace:
> [ 446.692609] update_blocked_averages+0x784/0x78c
> [ 446.697270] newidle_balance+0x184/0x5f0
> [ 446.701232] pick_next_task_fair+0x2c/0x500
> [ 446.705456] __schedule+0x1d4/0x1084
> [ 446.709070] schedule_idle+0x28/0x4c
> [ 446.712682] do_idle+0x1d4/0x2d0
> [ 446.715946] cpu_startup_entry+0x28/0x30
> [ 446.719909] secondary_start_kernel+0x138/0x15c
> [ 446.724486] __secondary_switched+0xb0/0xb4
> ...
> [ 446.765848] ---[ end trace 0000000000000000 ]---
>
>
> [ 95.091675] ------------[ cut here ]------------
> [ 95.096301] rq->balance_callback && rq->balance_callback != &balance_push_callback
> [ 95.096322] WARNING: CPU: 5 PID: 39 at kernel/sched/sched.h:1583 load_balance+0x644/0xdc0
> [ 95.103135] mutex_lock-torture: Creating lock_torture_writer task
> [ 95.103238] mutex_lock-torture: lock_torture_writer task started
> [ 95.110692] Modules linked in: locktorture(O+) torture(O)
> [ 95.136699] CPU: 5 PID: 39 Comm: migration/5 Tainted: G O 6.1.0-rc2-00036-g397ce37b37a8-dirty #204
> [ 95.147137] Hardware name: ARM Juno development board (r0) (DT)
> [ 95.153105] Stopper: 0x0 <- 0x0
> [ 95.156282] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 95.163306] pc : load_balance+0x644/0xdc0
> [ 95.167356] lr : load_balance+0x644/0xdc0
> [ 95.171405] sp : ffff80000b4cbaf0
> ...
> [ 95.246833] Call trace:
> [ 95.249300] load_balance+0x644/0xdc0
> [ 95.253000] newidle_balance+0x290/0x6f0
> [ 95.256963] pick_next_task_fair+0x2c/0x510
> [ 95.261188] __schedule+0x1d4/0x1084
> [ 95.264801] schedule+0x64/0x11c
> [ 95.268063] smpboot_thread_fn+0xa4/0x270
> [ 95.272115] kthread+0x10c/0x110
> [ 95.275375] ret_from_fork+0x10/0x20
> ...
> [ 95.316477] ---[ end trace 0000000000000000 ]---
>
>
> [ 134.893379] ------------[ cut here ]------------
> [ 134.898066] rq->balance_callback && rq->balance_callback != &balance_push_callback
> [ 134.898088] WARNING: CPU: 4 PID: 0 at kernel/sched/sched.h:1583 sched_rt_period_timer+0x1dc/0x3f0
> [ 134.914683] Modules linked in: locktorture(O) torture(O)
> [ 134.920059] CPU: 4 PID: 0 Comm: swapper/4 Tainted: G W O 6.1.0-rc2-00036-g397ce37b37a8-dirty #204
> [ 134.930235] Hardware name: ARM Juno development board (r0) (DT)
> [ 134.936205] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 134.943229] pc : sched_rt_period_timer+0x1dc/0x3f0
> [ 134.948069] lr : sched_rt_period_timer+0x1dc/0x3f0
> [ 134.952908] sp : ffff80000b2cbde0
> ...
> [ 135.028342] Call trace:
> [ 135.030810] sched_rt_period_timer+0x1dc/0x3f0
> [ 135.035300] __hrtimer_run_queues+0x184/0x504
> [ 135.039700] hrtimer_interrupt+0xe8/0x244
> [ 135.043749] arch_timer_handler_phys+0x2c/0x44
> [ 135.048239] handle_percpu_devid_irq+0x8c/0x150
> [ 135.052815] generic_handle_domain_irq+0x2c/0x44
> [ 135.057480] gic_handle_irq+0x44/0xc4
> [ 135.061180] call_on_irq_stack+0x2c/0x5c
> [ 135.065143] do_interrupt_handler+0x80/0x84
> ...
> [ 135.141122] ---[ end trace 0000000000000000 ]---
>
> >> With your changes to locktorture in {2-3}/3 you still run CFS
> >> lock_torture_writers but you should see the issue popping up
> >> in __set_cpus_allowed_ptr() (from torture_shuffle()) for example.
> >>
> >> Tried with:
> >>
> >> insmod /lib/modules/torture.ko
> >> insmod /lib/modules/locktorture.ko torture_type=mutex_lock rt_boost=1 rt_boost_factor=1 nlocks=3
> >> ^^^^^^^^^^^^^^^^^
> >>
> >> When changing all lock_torture_writer's to FIFO it becomes even
> >> more visible.
> >
> > Ok, thank you, I will make it more aggressively set to RT. The
> > previous locktorture was setting RT once every minute or so, at least
> > now that is down to 10 seconds ;-). But as you highlight with the
> > locktorture diff below, it needs to go further.
>
> Yeah, the test env is more aggressive this way and you spot potential
> issues quicker.
>
> > Anyway, this is going to be a nice holiday/Christmas project for me,
> > so thank you in advance for the holiday gift :-) ^_^
>
> Enjoy ;-)