Re: [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle

From: K Prateek Nayak
Date: Mon Mar 31 2025 - 23:17:27 EST


Hello Aaron,

On 3/31/2025 11:49 AM, Aaron Lu wrote:
Taskes throttled on exit to user path are scheduled by cond_resched() in
task_work_run() but that is a preempt schedule and doesn't mark a task
rcu quiescent state.

So browsing through kernel/rcu/task.h, I found the
cond_resched_tasks_rcu_qs() that basically clears task holdout before
calling schedule(). The question is, is it safe to be used in the
task_work_run() context? I have no idea but you can do a resched_curr()
followed by a cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() and
that should give you the same effect as doing a schedule().

Thoughts?


Fix this by directly calling schedule() in throttle_cfs_rq_work().
Perhaps that can be gotten around by just using set_ti_thread_flag()
resched_curr() will also call set_preempt_need_resched() which allows
cond_resched() to resched the task.

Since exit_to_user_mode_loop() will run once again seeing that
TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts?

I tried this and noticed an unexpected consequence: get_signal() will
also run task works and if the signal is kill, then it can happen:
exit_to_user_mode_loop() -> get_signal() -> throttle_task_work() ->
do_exit() -> exit_signals() -> percpu_rwsem_wait() -> schedule() ->
try_to_block_task() -> dequeue_task_fair().

I would like to avoid this path, at least for now. I want to make sure
for throttled tasks, only events like task group change, affinity change
etc. can cause it dequeue from core, that's why I added
SCHED_WARN_ON(flags & DEQUEUE_SLEEP) in dequeue_task_fair(). I think
this can help me capture any unexpected events like this.

Besides, I think explicitely calling schedule() has the advantage of not
relying on other code, i.e. it doesn't matter if someone removed
cond_resched() in task_work_run() some time later or someone changed the
logic in exit_to_user_mode_loop().

So I hope you don't mind I keep schedule() in throttle_cfs_rq_work(),
but if you see anything wrong of doing this, feel free to let me know,
thanks.

I don't have any strong feelings. Just that the open-coded schedule()
struck out like a sore thumb and since you mention future changes, the
"local_irq_enable_exit_to_user(ti_work)" could perhaps one day be
extended to not disable IRQs in exit_to_user_mode_loop() in which case
a direct call to schedule() can cause scheduling while atomic warnings.

IMO using cond_resched_tasks_rcu_qs() is cleaner and it is obvious that
the throttle work wants to call schedule() asap while also clearing the
RCU holdout but I'll wait for others to comment if it is safe to do so
or if we are missing something.

--
Thanks and Regards,
Prateek