Re: [External] Re: Fwd: WARNING: CPU: 13 PID: 3837105 at kernel/sched/sched.h:1561 __cfsb_csd_unthrottle+0x149/0x160

From: Hao Jia
Date: Thu Sep 07 2023 - 23:30:33 EST




On 2023/9/8 Tim Chen wrote:
On Thu, 2023-09-07 at 16:59 +0800, Hao Jia wrote:

On 2023/9/5 Peter Zijlstra wrote:
On Thu, Aug 31, 2023 at 04:48:29PM +0800, Hao Jia wrote:

If I understand correctly, rq->clock_update_flags may be set to
RQCF_ACT_SKIP after __schedule() holds the rq lock, and sometimes the rq
lock may be released briefly in __schedule(), such as newidle_balance(). At
this time Other CPUs hold this rq lock, and then calling
rq_clock_start_loop_update() may trigger this warning.

This warning check might be wrong. We need to add assert_clock_updated() to
check that the rq clock has been updated before calling
rq_clock_start_loop_update().

Maybe some things can be like this?

Urgh, aside from it being white space mangled, I think this is entirely
going in the wrong direction.

Leaking ACT_SKIP is dodgy as heck.. it's entirely too late to think
clearly though, I'll have to try again tomorrow.

I am trying to understand why this is an ACT_SKIP leak.
Before call to __cfsb_csd_unthrottle(), is it possible someone
else lock the runqueue, set ACT_SKIP and release rq_lock?
And then that someone never update the rq_clock?


Yes, we want to set rq->clock_update_flags to RQCF_ACT_SKIP to avoid updating the rq clock multiple times in __cfsb_csd_unthrottle().

But now we find ACT_SKIP leak, so we cannot unconditionally set rq->clock_update_flags to RQCF_ACT_SKIP in rq_clock_start_loop_update().



Hi Peter,

Do you think this fix method is correct? Or should we go back to the
beginning and move update_rq_clock() from unthrottle_cfs_rq()?

If anyone who locked the runqueue set ACT_SKIP also will update rq_clock,
I think your change is okay. Otherwise rq_clock could be missing update.

Thanks.

Tim