Re: [PATCH v2 11/15] sched/deadline: Mark DL server as unthrottled before enqueue

From: Daniel Bristot de Oliveira
Date: Fri Apr 05 2024 - 04:54:18 EST


On 3/13/24 02:24, Joel Fernandes (Google) wrote:
> The DL server may not have had its timer started if start_dl_timer()
> returns 0 (say the zero-laxity time has already passed). In such cases,
> mark the DL task which is about to be enqueued as not throttled and
> cancel any previous timers, then do the enqueue.
>
> This fixes the following crash:
>
> [ 9.263331] kernel BUG at kernel/sched/deadline.c:1765!
> [ 9.282382] Call Trace:
> [ 9.282767] <TASK>
> [ 9.283086] ? __die_body+0x62/0xb0
> [ 9.283602] ? die+0x9b/0xc0
> [ 9.284036] ? do_trap+0xa3/0x170
> [ 9.284528] ? enqueue_dl_entity+0x45e/0x460
> [ 9.285158] ? enqueue_dl_entity+0x45e/0x460
> [ 9.285791] ? handle_invalid_op+0x65/0x80
> [ 9.286392] ? enqueue_dl_entity+0x45e/0x460
> [ 9.287021] ? exc_invalid_op+0x2f/0x40
> [ 9.287585] ? asm_exc_invalid_op+0x16/0x20
> [ 9.288200] ? find_later_rq+0x120/0x120
> [ 9.288775] ? fair_server_init+0x40/0x40
> [ 9.289364] ? enqueue_dl_entity+0x45e/0x460
> [ 9.289989] ? find_later_rq+0x120/0x120
> [ 9.290564] dl_task_timer+0x1d7/0x2f0
> [ 9.291120] ? find_later_rq+0x120/0x120
> [ 9.291695] __run_hrtimer+0x73/0x1b0
> [ 9.292238] hrtimer_interrupt+0x216/0x2c0
> [ 9.292841] __sysvec_apic_timer_interrupt+0x53/0x140
> [ 9.293581] sysvec_apic_timer_interrupt+0x2d/0x80
> [ 9.294285] asm_sysvec_apic_timer_interrupt+0x16/0x20
>
> The crash can easily be reproduced by adding a 100ms delay as follows:
>
> +int delay_inject_count;
> +
> static void
> enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
> {
> @@ -1827,6 +1830,12 @@ enqueue_dl_entity(struct sched_dl_entity *dl_se, int flags)
> setup_new_dl_entity(dl_se);
> }
>
> + // 100ms delay every 20 enqueues.
> + if (delay_inject_count++ > 20) {
> + mdelay(100);
> + delay_inject_count = 0;
> + }
> +
> /*
> * If we are still throttled, eg. we got replenished but are a
> * zero-laxity task and still got to wait, don't enqueue.


Makes sense, I am adding this in the defer patch v6 as it is a fix for it...

-- Daniel