Re: [RFC PATCH 6/7] sched/fair: fix tasks_rcu with task based throttle
From: Aaron Lu
Date: Tue Apr 01 2025 - 04:49:13 EST
On Tue, Apr 01, 2025 at 08:47:02AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
>
> On 3/31/2025 11:49 AM, Aaron Lu wrote:
> > > > Taskes throttled on exit to user path are scheduled by cond_resched() in
> > > > task_work_run() but that is a preempt schedule and doesn't mark a task
> > > > rcu quiescent state.
>
> So browsing through kernel/rcu/task.h, I found the
> cond_resched_tasks_rcu_qs() that basically clears task holdout before
> calling schedule(). The question is, is it safe to be used in the
> task_work_run() context? I have no idea but you can do a resched_curr()
> followed by a cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() and
> that should give you the same effect as doing a schedule().
>
> Thoughts?
>
Looks good to me.
> > > >
> > > > Fix this by directly calling schedule() in throttle_cfs_rq_work().
> > > Perhaps that can be gotten around by just using set_ti_thread_flag()
> > > resched_curr() will also call set_preempt_need_resched() which allows
> > > cond_resched() to resched the task.
> > >
> > > Since exit_to_user_mode_loop() will run once again seeing that
> > > TIF_NEED_RESCHED is set, schedule() should follow soon. Thoughts?
> > >
> > I tried this and noticed an unexpected consequence: get_signal() will
> > also run task works and if the signal is kill, then it can happen:
> > exit_to_user_mode_loop() -> get_signal() -> throttle_task_work() ->
> > do_exit() -> exit_signals() -> percpu_rwsem_wait() -> schedule() ->
> > try_to_block_task() -> dequeue_task_fair().
> >
> > I would like to avoid this path, at least for now. I want to make sure
> > for throttled tasks, only events like task group change, affinity change
> > etc. can cause it dequeue from core, that's why I added
> > SCHED_WARN_ON(flags & DEQUEUE_SLEEP) in dequeue_task_fair(). I think
> > this can help me capture any unexpected events like this.
> >
> > Besides, I think explicitely calling schedule() has the advantage of not
> > relying on other code, i.e. it doesn't matter if someone removed
> > cond_resched() in task_work_run() some time later or someone changed the
> > logic in exit_to_user_mode_loop().
> >
> > So I hope you don't mind I keep schedule() in throttle_cfs_rq_work(),
> > but if you see anything wrong of doing this, feel free to let me know,
> > thanks.
>
> I don't have any strong feelings. Just that the open-coded schedule()
> struck out like a sore thumb and since you mention future changes, the
> "local_irq_enable_exit_to_user(ti_work)" could perhaps one day be
> extended to not disable IRQs in exit_to_user_mode_loop() in which case
> a direct call to schedule() can cause scheduling while atomic warnings.
>
> IMO using cond_resched_tasks_rcu_qs() is cleaner and it is obvious that
> the throttle work wants to call schedule() asap while also clearing the
> RCU holdout but I'll wait for others to comment if it is safe to do so
> or if we are missing something.
I'm running some tests and I'll follow your suggestion if no problem
found, thanks for the suggestion.