Re: [PATCH 0/5] srcu fixes
From: Frederic Weisbecker
Date: Wed Oct 04 2023 - 05:25:42 EST
On Tue, Oct 03, 2023 at 05:35:31PM -0700, Paul E. McKenney wrote:
> On Wed, Oct 04, 2023 at 01:28:58AM +0200, Frederic Weisbecker wrote:
> > Hi,
> >
> > This contains a fix for "SRCU: kworker hung in synchronize_srcu":
> >
> > http://lore.kernel.org/CANZk6aR+CqZaqmMWrC2eRRPY12qAZnDZLwLnHZbNi=xXMB401g@xxxxxxxxxxxxxx
> >
> > And a few cleanups.
> >
> > Passed 50 hours of SRCU-P and SRCU-N.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > srcu/fixes
> >
> > HEAD: 7ea5adc5673b42ef06e811dca75e43d558cc87e0
> >
> > Thanks,
> > Frederic
>
> Very good, and a big "Thank You!!!" to all of you!
>
> I queued this series for testing purposes, and have started a bunch of
> SRCU-P and SRCU-N tests on one set of systems, and a single SRCU-P and
> SRCU-N on another system, but with both scenarios resized to 40 CPU each.
>
> While that is in flight, a few questions:
>
> o Please check the Co-developed-by rules. Last I knew, it was
> necessary to have a Signed-off-by after each Co-developed-by.
Indeed! I'll try to collect the three of them within a few days. If some
are missing, I'll put a Reported-by instead.
>
> o Is it possible to get a Tested-by from the original reporter?
> Or is this not reproducible?
It seems that the issue would trigger rarely. But I hope we can get one.
>
> o Is it possible to convince rcutorture to find this sort of
> bug? Seems like it should be, but easy to say...
So at least the part where advance/accelerate fail is observed from time
to time. But then we must meet two more rare events:
1) The CPU failing to ACC/ADV must also fail to start the grace period because
another CPU was faster.
2) The callbacks invocation must not run until that grace period has ended (even
though we had a previous one completed with callbacks ready).
Or it can run after all but at least the acceleration part of it has to
happen after the end of the new grace period.
Perhaps all these conditions can me met more often if we overcommit the number
of vCPU. For example run 10 SRCU-P instances within 3 real CPUs. This could
introduce random breaks within the torture writers...
Just an idea...
>
> o Frederic, would you like to include this in your upcoming
> pull request? Or does it need more time?
At least the first patch yes. It should be easily backported and
it should be enough to solve the race. I'll just wait a bit to collect
more tags.
Thanks!
>
> Thanx, Paul
>
> > ---
> >
> > Frederic Weisbecker (5):
> > srcu: Fix callbacks acceleration mishandling
> > srcu: Only accelerate on enqueue time
> > srcu: Remove superfluous callbacks advancing from srcu_start_gp()
> > srcu: No need to advance/accelerate if no callback enqueued
> > srcu: Explain why callbacks invocations can't run concurrently
> >
> >
> > kernel/rcu/srcutree.c | 55 ++++++++++++++++++++++++++++++++++++---------------
> > 1 file changed, 39 insertions(+), 16 deletions(-)