Re: [PATCH 2/2] timers: Fix removed self-IPI on global timer's enqueue in nohz_full

From: Frederic Weisbecker
Date: Wed Mar 20 2024 - 12:15:55 EST


Le Wed, Mar 20, 2024 at 04:14:24AM -0700, Paul E. McKenney a écrit :
> On Tue, Mar 19, 2024 at 02:18:00AM -0700, Paul E. McKenney wrote:
> > On Tue, Mar 19, 2024 at 12:07:29AM +0100, Frederic Weisbecker wrote:
> > > While running in nohz_full mode, a task may enqueue a timer while the
> > > tick is stopped. However the only places where the timer wheel,
> > > alongside the timer migration machinery's decision, may reprogram the
> > > next event accordingly with that new timer's expiry are the idle loop or
> > > any IRQ tail.
> > >
> > > However neither the idle task nor an interrupt may run on the CPU if it
> > > resumes busy work in userspace for a long while in full dynticks mode.
> > >
> > > To solve this, the timer enqueue path raises a self-IPI that will
> > > re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
> > > avoids potential locking inversion.
> > >
> > > This is supposed to happen both for local and global timers but commit:
> > >
> > > b2cf7507e186 ("timers: Always queue timers on the local CPU")
> > >
> > > broke the global timers case with removing the ->is_idle field handling
> > > for the global base. As a result, global timers enqueue may go unnoticed
> > > in nohz_full.
> > >
> > > Fix this with restoring the idle tracking of the global timer's base,
> > > allowing self-IPIs again on enqueue time.
> >
> > Testing with the previous patch (1/2 in this series) reduced the number of
> > problems by about an order of magnitude, down to two sched_tick_remote()
> > instances and one enqueue_hrtimer() instance, very good!
> >
> > I have kicked off a test including this patch. Here is hoping! ;-)
>
> And 22*100 hours of TREE07 got me one run with a sched_tick_remote()
> complaint and another run with a starved RCU grace-period kthread.
> So this is definitely getting more reliable, but still a little ways
> to go.

Right, there is clearly something else. Investigation continues...