Re: [PATCH sched_ext/for-6.12] sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable

From: Tejun Heo
Date: Wed Sep 04 2024 - 16:24:09 EST


On Fri, Aug 30, 2024 at 10:02:34PM -1000, Tejun Heo wrote:
> During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task()
> on every task. To do this, it does get_task_struct() on each iterated task,
> drop the lock and then call ops.init_task().
>
> However, a TASK_DEAD task may already have lost all its usage count and be
> waiting for RCU grace period to be freed. If get_task_struct() is called on
> such task, use-after-free can happen. To avoid such situations,
> scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe
> as they are never going to be scheduled again.
>
> Unfortunately, a racing sched_setscheduler(2) can grab the task before the
> task is unhashed and then continue to e.g. move the task from RT to SCX
> after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't
> gone through scx_ops_init_task(), scx_ops_enable_task() called from
> switching_to_scx() triggers the following warning:
>
> sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872]
> WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0
> ...
> RIP: 0010:scx_ops_enable_task+0x18f/0x1f0
> ...
> switching_to_scx+0x13/0xa0
> __sched_setscheduler+0x84e/0xa50
> do_sched_setscheduler+0x104/0x1c0
> __x64_sys_sched_setscheduler+0x18/0x30
> do_syscall_64+0x7b/0x140
> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>
> As in the ops_disable path, it just doesn't seem like a good idea to leave
> any task in an inconsistent state, even when the task is dead. The root
> cause is ops_enable not being able to tell reliably whether a task is truly
> dead (no one else is looking at it and it's about to be freed) and was
> testing TASK_DEAD instead. Fix it by testing the task's usage count
> directly.
>
> - ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all
> tasks, @include_dead is removed from scx_task_iter_next_locked() along
> with dead task filtering.
>
> - tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct()
> fails.
>
> Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
> Cc: David Vernet <void@xxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>

Applied to sched_ext/for-6.12.

Thanks.

--
tejun