Re: [linus:master] [rcu/nocb] 7625926086: WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_try_bypass

From: Paul E. McKenney
Date: Tue Jul 11 2023 - 12:42:06 EST


On Tue, Jul 11, 2023 at 06:06:11PM +0200, Frederic Weisbecker wrote:
> On Tue, Jul 11, 2023 at 04:48:06PM +0800, kernel test robot wrote:
> >
> >
> > Hello,
> >
> > kernel test robot noticed "WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_try_bypass" on:
> >
> > commit: 7625926086765123251f765d91fc3a70617d334d ("rcu/nocb: Fix shrinker race against callback enqueuer")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > [test failed on linus/master Linux 6.5-rc1]
> > [test failed on linux-next/master fe57d0d86f03a8b2afe2869a95477d0ed1824c96]
> >
> > in testcase: trinity
> > version: trinity-i386-abe9de86-1_20230429
> > with following parameters:
> >
> > runtime: 300s
> > group: group-00
> >
> > test-description: Trinity is a linux system call fuzz tester.
> > test-url: http://codemonkey.org.uk/projects/trinity/
> >
> >
> > compiler: gcc-12
> > test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
> >
> > (please refer to attached dmesg/kmsg for entire log/backtrace):
> >
> >
> > the WARNING only happens in a low rate (10 out of 204 runs as below).
> > but since we never observed the issue on parent, and we also noticed
> > there is a WARN_ON_ONCE in code change, so just report what we observed
> > in our tests FYI.
> >
> > 5c83cedbaaad6dfe 7625926086765123251f765d91f
> > ---------------- ---------------------------
> > fail:runs %reproduction fail:runs
> > | | |
> > :202 5% 10:204 dmesg.RIP:rcu_nocb_bypass_lock
> > :202 5% 10:204 dmesg.RIP:rcu_nocb_try_bypass
> > :202 5% 10:204 dmesg.WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_bypass_lock
> > :202 5% 10:204 dmesg.WARNING:at_kernel/rcu/tree_nocb.h:#rcu_nocb_try_bypass
> >
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > | Closes: https://lore.kernel.org/oe-lkp/202307111535.4823c255-oliver.sang@xxxxxxxxx
> >
> >
> >
> > [ 315.569574][ T16] WARNING: CPU: 1 PID: 16 at kernel/rcu/tree_nocb.h:124
> > rcu_nocb_try_bypass (kernel/rcu/tree_nocb.h:124 kernel/rcu/tree_nocb.h:513)
>
> It's:
>
> static void rcu_nocb_wait_contended(struct rcu_data *rdp)
> {
> WARN_ON_ONCE(smp_processor_id() != rdp->cpu);
> while (WARN_ON_ONCE(atomic_read(&rdp->nocb_lock_contended))) <<< HERE
> cpu_relax();
> }
>
> This is called on enqueue time by rcu_nocb_try_bypass().
>
> The places that can increment ->nocb_lock_contended are:
>
> * rcu_nocb_try_bypass() itself, but it's not re-entrant
>
> * rcu_nocb_rdp_deoffload(), but it executes on the local CPU with
> IRQs disabled, so rcu_nocb_try_bypass() can't race
>
> * lazy_rcu_shrink_scan(): that's indeed the issue here
>
> * rcu_barrier_entrain(), executed either locally with IRQs disabled
> or remotely when the target is dead, so can't race with rcu_nocb_try_bypass()
>
> * rcutree_migrate_callback(): locally with IRQs disabled, can't race
>
> So indeed lazy_rcu_shrink_scan() is the only possible culprit. But then
> I wonder, what is the purpose of this ->nocb_lock_contended if it's actually
> not an allowed behaviour?
>
> Paul, any recollection of that?

Heh!

The purpose was to see if this lock was ever contended. I guess we now
have an answer, which is "Yes, but rarely."

It looks like the victim commit increased the size of the ->nocb_lock
critical section, just enough to make this happen sometimes.

Removing the WARN_ON_ONCE() seems appropriate, especially given that
this only happens when shrinking.

Should we add something that monitors that lock's contention? It is
often the case that lock contention rises over time as new features and
optimizations are added.

Thanx, Paul