Re: [PATCH V4 2/2] rcu: Update jiffies in rcu_cpu_stall_reset()

From: Huacai Chen
Date: Thu Aug 17 2023 - 00:05:31 EST


Hi, Alan,

On Thu, Aug 17, 2023 at 12:52 AM Alan Huang <mmpgouride@xxxxxxxxx> wrote:
>
> >>
> >>>>>>>>>>>
> >>>>>>>>>>> Currently rcu_cpu_stall_reset() set rcu_state.jiffies_stall to one check
> >>>>>>>>>>> period later, i.e. jiffies + rcu_jiffies_till_stall_check(). But jiffies
> >>>>>>>>>>> is only updated in the timer interrupt, so when kgdb_cpu_enter() begins
> >>>>>>>>>>> to run there may already be nearly one rcu check period after jiffies.
> >>>>>>>>>>> Since all interrupts are disabled during kgdb_cpu_enter(), jiffies will
> >>>>>>>>>>> not be updated. When kgdb_cpu_enter() returns, rcu_state.jiffies_stall
> >>>>>>>>>>> maybe already gets timeout.
> >>>>>>>>>>>
> >>>>>>>>>>> We can set rcu_state.jiffies_stall to two rcu check periods later, e.g.
> >>>>>>>>>>> jiffies + (rcu_jiffies_till_stall_check() * 2) in rcu_cpu_stall_reset()
> >>>>>>>>>>> to avoid this problem. But this isn't a complete solution because kgdb
> >>>>>>>>>>> may take a very long time in irq disabled context.
> >>>>>>>>>>>
> >>>>>>>>>>> Instead, update jiffies at the beginning of rcu_cpu_stall_reset() can
> >>>>>>>>>>> solve all kinds of problems.
> >>>>>>>>>>
> >>>>>>>>>> Would it make sense for there to be a kgdb_cpu_exit()? In that case,
> >>>>>>>>>> the stalls could simply be suppressed at the beginning of the debug
> >>>>>>>>>> session and re-enabled upon exit, as is currently done for sysrq output
> >>>>>>>>>> via rcu_sysrq_start() and rcu_sysrq_end().
> >>>>>>>>> Thank you for your advice, but that doesn't help. Because
> >>>>>>>>> rcu_sysrq_start() and rcu_sysrq_end() try to suppress the warnings
> >>>>>>>>> during sysrq, but kgdb already has no warnings during kgdb_cpu_enter()
> >>>>>>>>> since it is executed in irq disabled context. Instead, this patch
> >>>>>>>>> wants to suppress the warnings *after* kgdb_cpu_enter() due to a very
> >>>>>>>>> old jiffies value.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Hello, Huacai
> >>>>>>>>
> >>>>>>>> Is it possible to set the rcu_cpu_stall_suppress is true in
> >>>>>>>> dbg_touch_watchdogs()
> >>>>>>>> and reset the rcu_cpu_stall_suppress at the beginning and end of the
> >>>>>>>> RCU grace period?
> >>>>>>> This is possible but not the best: 1, kgdb is not the only caller of
> >>>>>>> rcu_cpu_stall_reset(); 2, it is difficult to find the "end" to reset
> >>>>>>> rcu_cpu_stall_suppress.
> >>>>>>>
> >>>>>>
> >>>>>> You can replace rcu_state.jiffies_stall update by setting rcu_cpu_stall_suppress
> >>>>>> in rcu_cpu_stall_reset(), and reset rcu_cpu_stall_suppress in rcu_gp_init() and
> >>>>>> rcu_gp_cleanup().
> >>>>> What's the advantage compared with updating jiffies? Updating jiffies
> >>>>> seems more straight forward.
> >>>>>
> >>>>
> >>>> In do_update_jiffies_64(), need to acquire jiffies_lock raw spinlock,
> >>>> like you said, kgdb is not the only caller of rcu_cpu_stall_reset(),
> >>>> the rcu_cpu_stall_reset() maybe invoke in NMI (arch/x86/platform/uv/uv_nmi.c)
> >>> Reset rcu_cpu_stall_suppress in rcu_gp_init()/rcu_gp_cleanup() is
> >>> still not so good to me, because it does a useless operation in most
> >>> cases. Moreover, the rcu core is refactored again and again, something
> >>> may be changed in future.
> >>>
> >>> If do_update_jiffies_64() cannot be used in NMI context, can we
> >>
> >> What about updating jiffies in dbg_touch_watchdogs or adding a wrapper which updates
> >> both jiffies and jiffies_stall?
> > This can solve the kgdb problem, but I found that most callers of
> > rcu_cpu_stall_reset() are in irq disabled context so they may meet
>
> The duration of other contexts where interrupts are disabled may not be as long as in the case of kgdb?
>
> > similar problems. Modifying rcu_cpu_stall_reset() can solve all of
> > them.
> >
> > But due to the NMI issue, from my point of view, setting jiffies_stall
> > to jiffies + 300*HZ is the best solution now. :)
>
> If I understand correctly, the NMI issue is the deadlock issue? If so, plus the short duration of other irq disabled
> contexts, it’s ok just update jiffies in dbg_touch_watchdogs.
>
> Please correct me if anything wrong. :)
The timeout value can be configured as short as 3 seconds, in this
case other callers may also have problems.

Huacai
>
> >
> > Huacai
> >>
> >>> consider my old method [1]?
> >>> https://lore.kernel.org/rcu/CAAhV-H7j9Y=VvRLm8thLw-EX1PGqBA9YfT4G1AN7ucYS=iP+DQ@xxxxxxxxxxxxxx/T/#t
> >>>
> >>> Of course we should set rcu_state.jiffies_stall large enough, so we
> >>> can do like this:
> >>>
> >>> void rcu_cpu_stall_reset(void)
> >>> {
> >>> WRITE_ONCE(rcu_state.jiffies_stall,
> >>> - jiffies + rcu_jiffies_till_stall_check());
> >>> + jiffies + 300 * HZ);
> >>> }
> >>>
> >>> 300s is the largest timeout value, and I think 300s is enough here in practice.
> >>>
> >>> Huacai
> >>>
> >>>>
> >>>> Thanks
> >>>> Zqiang
> >>>>
> >>>>
> >>>>> Huacai
> >>>>>
> >>>>>>
> >>>>>> Thanks
> >>>>>> Zqiang
> >>>>>>
> >>>>>>>
> >>>>>>>> or set rcupdate.rcu_cpu_stall_suppress_at_boot=1 in bootargs can
> >>>>>>>> suppress RCU stall
> >>>>>>>> in booting.
> >>>>>>> This is also possible, but it suppresses all kinds of stall warnings,
> >>>>>>> which is not what we want.
> >>>>>>>
> >>>>>>> Huacai
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks
> >>>>>>>> Zqiang
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Huacai
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thanx, Paul
> >>>>>>>>>>
> >>>>>>>>>>> Cc: stable@xxxxxxxxxxxxxxx
> >>>>>>>>>>> Fixes: a80be428fbc1f1f3bc9ed924 ("rcu: Do not disable GP stall detection in rcu_cpu_stall_reset()")
> >>>>>>>>>>> Reported-off-by: Binbin Zhou <zhoubinbin@xxxxxxxxxxx>
> >>>>>>>>>>> Signed-off-by: Huacai Chen <chenhuacai@xxxxxxxxxxx>
> >>>>>>>>>>> ---
> >>>>>>>>>>> kernel/rcu/tree_stall.h | 1 +
> >>>>>>>>>>> 1 file changed, 1 insertion(+)
> >>>>>>>>>>>
> >>>>>>>>>>> diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
> >>>>>>>>>>> index b10b8349bb2a..1c7b540985bf 100644
> >>>>>>>>>>> --- a/kernel/rcu/tree_stall.h
> >>>>>>>>>>> +++ b/kernel/rcu/tree_stall.h
> >>>>>>>>>>> @@ -153,6 +153,7 @@ static void panic_on_rcu_stall(void)
> >>>>>>>>>>> */
> >>>>>>>>>>> void rcu_cpu_stall_reset(void)
> >>>>>>>>>>> {
> >>>>>>>>>>> + do_update_jiffies_64(ktime_get());
> >>>>>>>>>>> WRITE_ONCE(rcu_state.jiffies_stall,
> >>>>>>>>>>> jiffies + rcu_jiffies_till_stall_check());
> >>>>>>>>>>> }
> >>>>>>>>>>> --
> >>>>>>>>>>> 2.39.3
>
>