Re: [RFC][PATCH 08/10] sched/fair: Implement delayed dequeue

From: Hongyan Xia
Date: Tue Jun 04 2024 - 10:09:50 EST


Hi Peter,

On 04/06/2024 11:11, Peter Zijlstra wrote:
On Mon, Jun 03, 2024 at 08:30:43PM +0100, Luis Machado wrote:

Exchanging some information with Hongyan today, he was a bit suspicious of the uclamp
behavior with the eevdf complete series applied.

Checking the uclamp code, I see we have some refcounting tied to enqueuing/dequeuing
of tasks, and the uclamp values are organized in buckets.

Just for fun I added a few trace_printk's in uclamp_eff_value, uclamp_rq_inc_id and
uclamp_rq_dec_id.

Booting up the system with delayed_dequeue disabled and running the benchmark, I
see the uclamp bucket management pretty stable. Tasks get added to the uclamp
buckets but then get removed. At the end of the benchmark, the uclamp buckets
are (almost always) clean of tasks.

Enabling delayed dequeue, I can see the uclamp buckets slowly filling up with
tasks. At the end of the benchmark, I see uclamp buckets with 30, 40 or 50
tasks still. If I do another run, I can see 80, 100 tasks still.

I suspect refcounting might be going wrong somewhere due to delayed dequeue
tasks, but that's more of a guess right now. Hopefully that is useful
information. I'll resume investigation tomorrow.

Thank you both!!

Does the below help?

Note how dequeue_task() does uclamp_rq_dec() unconditionally, which is
then not balanced in the case below.

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3664,6 +3664,7 @@ static int ttwu_runnable(struct task_str
/* mustn't run a delayed task */
SCHED_WARN_ON(task_on_cpu(rq, p));
enqueue_task(rq, p, ENQUEUE_DELAYED);
+ uclamp_rq_inc(rq, p);
}
if (!task_on_cpu(rq, p)) {
/*

Further communication with Luis makes me investigate more on uclamp bucket leaks.

I added a separate variable uclamp_on_rq in task_struct and WARN when the state didn't match. The thing is, without your above fix, the first WARN happens on uclamp_rq_inc(), not dec(), which means a task is already added to rq uclamp and is somehow added again by mistake.

Because this is a double inc(), your fix won't fix this problem. I added the trace here when the double inc() happened, in case you can see something.

Juno-r2-Debian login: [ 176.010780] ------------[ cut here ]------------ [1083/15764]
[ 176.015414] WARNING: CPU: 5 PID: 0 at kernel/sched/core.c:1594 uclamp_rq_inc+0x188/0x1a0
[ 176.023539] Modules linked in:

[ 176.026602] CPU: 5 PID: 0 Comm: swapper/5 Not tainted 6.10.0-rc1-ga946a2b48119-dirty #1034
[ 176.034882] Hardware name: ARM Juno development board (r2) (DT)

[ 176.040815] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 176.047796] pc : uclamp_rq_inc+0x188/0x1a0

[ 176.051903] lr : enqueue_task+0x48/0x70

[ 176.055752] sp : ffffffc081fb3de0

[ 176.059073] x29: ffffffc081fb3de0 x28: 0000000000000000 x27: ffffff8800342300
[ 176.066233] x26: ffffff88084311b8 x25: ffffffc081ed2000 x24: ffffffc081ca7008
[ 176.073388] x23: 0000000000000000 x22: 00000000000000c0 x21: ffffff897efbbf40
[ 176.080543] x20: ffffff8808431180 x19: 0000000000000049 x18: 0000000000000408
[ 176.087700] x17: ffffffc8fd301000 x16: ffffffc081fb0000 x15: 000000000000afc8
[ 176.094857] x14: 0000000000000001 x13: 0000000000000004 x12: 000000020c8392c2
[ 176.102017] x11: 0000000000000800 x10: 0000000000000800 x9 : 0000000000000000
[ 176.109177] x8 : 0000000000000001 x7 : 0000000000000021 x6 : ffffffffffffff4c
[ 176.116337] x5 : 0000000000007e90 x4 : 00000000000fc33c x3 : 00000000000139b4
[ 176.123487] x2 : 0000000000000049 x1 : ffffff8808431180 x0 : ffffff897efbbf40
[ 176.130639] Call trace:

[ 176.133089] uclamp_rq_inc+0x188/0x1a0

[ 176.136849] activate_task+0x48/0x60

[ 176.140433] ttwu_do_activate+0x80/0x22c

[ 176.144364] sched_ttwu_pending+0x9c/0x120

[ 176.148470] __flush_smp_call_function_queue+0x3d0/0x568

[ 176.153792] generic_smp_call_function_single_interrupt+0x10/0x18

[ 176.159902] do_handle_IPI+0xa0/0x230

[ 176.163577] ipi_handler+0x18/0x24

[ 176.166987] handle_percpu_devid_irq+0x80/0x204

[ 176.171532] generic_handle_domain_irq+0x28/0x40

[ 176.176161] gic_handle_irq+0x58/0xd0

[ 176.179833] call_on_irq_stack+0x24/0x34

[ 176.183764] el1_interrupt+0x70/0xbc

[ 176.187348] el1h_64_irq_handler+0x14/0x1c

[ 176.191449] el1h_64_irq+0x64/0x68

[ 176.194857] arch_local_irq_enable+0x4/0x8

[ 176.198960] cpuidle_enter+0x34/0x48

[ 176.202547] do_idle+0x1c4/0x234

[ 176.205786] cpu_startup_entry+0x30/0x34

[ 176.209723] secondary_start_kernel+0x124/0x144

[ 176.214267] __secondary_switched+0x6c/0x70

[ 176.218458] ---[ end trace 0000000000000000 ]---

Or, do you mean to add a uclamp_rq_dec() instead? But even if I skip the uclamp_rq_inc() at that line (which is equivalent to doing uclamp_rq_dec() after enqueue_task()), I still get a WARN like this:

Juno-r2-Debian login: [ 77.183511] ------------[ cut here ]------------
[ 77.188157] WARNING: CPU: 2 PID: 114 at kernel/sched/core.c:1594 uclamp_rq_inc+0x188/0x1a0
[ 77.196455] Modules linked in:

[ 77.199519] CPU: 2 PID: 114 Comm: kworker/u25:3 Not tainted 6.10.0-rc1-ga946a2b48119-dirty #1035
[ 77.208325] Hardware name: ARM Juno development board (r2) (DT)

[ 77.214255] Workqueue: rpciod rpc_async_schedule

[ 77.218893] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 77.225873] pc : uclamp_rq_inc+0x188/0x1a0

[ 77.229979] lr : enqueue_task+0x44/0x7c

[ 77.233826] sp : ffffffc082af3c80

[ 77.237143] x29: ffffffc082af3c80 x28: ffffffc081cd6000 x27: 0000000000000402
[ 77.244305] x26: ffffffc8fd285000 x25: ffffffc081cbaf40 x24: ffffff8800f92300
[ 77.251466] x23: ffffff897ef3ff40 x22: 00000000000000c0 x21: ffffff897ef3ff40
[ 77.258625] x20: ffffff88028f3480 x19: 0000000000000009 x18: 0000000000000000
[ 77.265784] x17: 0000000000000000 x16: 0000000000000001 x15: 00000000000003ae
[ 77.272943] x14: 000000000003eca5 x13: 000000000003eca5 x12: 00000000000000eb
[ 77.280102] x11: 0000000000000000 x10: 0000000000001800 x9 : 0000000000000000
[ 77.287261] x8 : 0000000000000001 x7 : 0000000000000000 x6 : 000000000000003f
[ 77.294420] x5 : 0000000000000311 x4 : 0000000000000113 x3 : 000000000000017f
[ 77.301578] x2 : 0000000000000009 x1 : ffffff88028f3480 x0 : ffffff897ef3ff40
[ 77.308738] Call trace:

[ 77.311185] uclamp_rq_inc+0x188/0x1a0

[ 77.314944] activate_task+0x48/0x60

[ 77.318530] ttwu_do_activate+0x80/0x22c

[ 77.322464] try_to_wake_up+0x1e8/0x50c

[ 77.326311] wake_up_process+0x14/0x1c
[ 77.330070] kick_pool+0xb8/0x144
[ 77.333395] process_one_work+0x104/0x3a4
[ 77.337418] worker_thread+0x2b8/0x424
[ 77.341176] kthread+0xf8/0x194
[ 77.344327] ret_from_fork+0x10/0x20
[ 77.347913] ---[ end trace 0000000000000000 ]---

I pasted the trace here in case useful.

Hongyan