Re: [PATCH v2 1/2] maple_tree: Disable mas_wr_append() when other readers are possible

From: Paul E. McKenney
Date: Wed Sep 06 2023 - 14:03:00 EST


On Wed, Sep 06, 2023 at 01:29:54PM -0400, Liam R. Howlett wrote:
> * Paul E. McKenney <paulmck@xxxxxxxxxx> [230906 13:24]:
> > On Wed, Sep 06, 2023 at 11:23:25AM -0400, Liam R. Howlett wrote:
> > > (Adding Paul & Shanker to Cc list.. please see below for why)
> > >
> > > Apologies on the late response, I was away and have been struggling to
> > > get a working PPC32 test environment.
> > >
> > > * Geert Uytterhoeven <geert@xxxxxxxxxxxxxx> [230829 12:42]:
> > > > Hi Liam,
> > > >
> > > > On Fri, 18 Aug 2023, Liam R. Howlett wrote:
> > > > > The current implementation of append may cause duplicate data and/or
> > > > > incorrect ranges to be returned to a reader during an update. Although
> > > > > this has not been reported or seen, disable the append write operation
> > > > > while the tree is in rcu mode out of an abundance of caution.
> > >
> > > ...
> > > > >
> > > > > Fixes: 54a611b60590 ("Maple Tree: add new data structure")
> > > > > Cc: stable@xxxxxxxxxxxxxxx
> > > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx>
> > > >
> > > > Thanks for your patch, which is now commit cfeb6ae8bcb96ccf
> > > > ("maple_tree: disable mas_wr_append() when other readers are
> > > > possible") in v6.5, and is being backported to stable.
> > > >
> > > > On Renesas RZ/A1 and RZ/A2 (single-core Cortex-A9), this causes the
> > > > following warning:
> > > >
> > > > clocksource: timer@e803b000: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 28958491609 ns
> > > > sched_clock: 32 bits at 66MHz, resolution 15ns, wraps every 32537631224ns
> > > > /soc/timer@e803b000: used for clocksource
> > > > /soc/timer@e803c000: used for clock events
> > > > +------------[ cut here ]------------
> > > > +WARNING: CPU: 0 PID: 0 at init/main.c:992 start_kernel+0x2f0/0x480
> > > > +Interrupts were enabled early
> > >
> > > Note that the maple tree is involved in tracking the interrupts, see
> > > kernel/irq/irqdesc.c irq_insert_desc(), etc.
> > >
> > > > +CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0-rza2mevb-10197-g99b80d6b92b5 #237
> > >
> > > I cannot find commit id 99b80d6b92b5.
> > >
> > > > +Hardware name: Generic R7S9210 (Flattened Device Tree)
> > > > + unwind_backtrace from show_stack+0x10/0x14
> > > > + show_stack from dump_stack_lvl+0x24/0x3c
> > > > + dump_stack_lvl from __warn+0x74/0xb8
> > > > + __warn from warn_slowpath_fmt+0x78/0xb0
> > > > + warn_slowpath_fmt from start_kernel+0x2f0/0x480
> > > > + start_kernel from 0x0
> > > > +---[ end trace 0000000000000000 ]---
> > > > Console: colour dummy device 80x30
> > > > printk: console [tty0] enabled
> > > > Calibrating delay loop (skipped) preset value.. 1056.00 BogoMIPS (lpj=5280000)
> > > >
> > > > Reverting this commit fixes the issue.
> > >
> > > I have set up testing with qemu for powerpc 32b, and reverting this
> > > patch does not fix it for me. Did you revert the patch or bisect to the
> > > issue?
> > >
> > > It also happens on 0e0e9bd5f7b9 (I ran git checkout cfeb6ae8bcb96ccf^ to
> > > get the commit immediately before cfeb6ae8bcb96ccf). My qemu command is
> > > as follows:
> > >
> > > qemu-system-ppc -L pc-bios -boot c -prom-env "boot-device=hd:,\yaboot"
> > > -prom-env "boot-args=conf=hd:,\yaboot.conf" -M mac99,via=pmu -m 2048
> > > -hda powerpc32.img -nographic -kernel <file>
> > >
> > >
> > > >
> > > > RCU-related configs:
> > > >
> > > > $ grep RCU .config
> > > > # RCU Subsystem
> > > > CONFIG_TINY_RCU=y
> > > > # CONFIG_RCU_EXPERT is not set
> > > > CONFIG_TINY_SRCU=y
> > > > # end of RCU Subsystem
> > > > # RCU Debugging
> > > > # CONFIG_RCU_SCALE_TEST is not set
> > > > # CONFIG_RCU_TORTURE_TEST is not set
> > > > # CONFIG_RCU_REF_SCALE_TEST is not set
> > > > # CONFIG_RCU_TRACE is not set
> > > > # CONFIG_RCU_EQS_DEBUG is not set
> > > > # end of RCU Debugging
> > >
> > > I used the configuration from debian 8 and ran 'make oldconfig' to build
> > > my kernel. I have attached the configuration.
> > >
> > > >
> > > > CONFIG_MAPLE_RCU_DISABLED is not defined (and should BTW be renamed,
> > > > as CONFIG_* is reserved for kernel configuration options).
> > >
> > > Thanks, I'll add it to my list of work.
> > >
> > > >
> > > > I do not see this issue on any other platform
> > > > (arm/arm64/risc-v/mips/sh/m68k), several of them use the same
> > > > RCU configuration.
> > > >
> > > > Do you have a clue?
> > >
> > > It appears to be something to do with struct maple_tree sparse_irqs. If
> > > you drop the rcu flag from that maple tree, then my configuration boots
> > > without the warning.
> > >
> > > I *think* this is because we will reuse a lot more nodes. And I *think*
> > > the rcu flag is not needed, since there is a comment about reading the
> > > tree being protected by the mutex sparse_irq_lock within the
> > > kernel/irq/irqdesc.c file. Shanker, can you comment on that?
> > >
> > > I wonder if there is a limit to the number of RCU free events before
> > > something is triggered to flush them out which could trigger IRQ
> > > enabling? Paul, could this be the case?
> >
> > Are you asking if call_rcu() will re-enable interrupts in the following
> > use case?
> >
> > local_irq_disable();
> > call_rcu(&p->rh, my_cb_func);
> > local_irq_enable();
> >
> > If so, the answer is "no" (and yes, there might be a bug, but then again
> > I just double-checked). However, if interrupts are enabled across a
> > call_rcu() invocation, it will do some additional work to encourage
> > the grace period. Even if interrupts are disabled across a call_rcu()
> > invocation, it will still do either a raise_softirq() or a wakeup,
> > depending on configuration, to encourage the grace period. And in
> > the case of call_rcu() from an interrupt handler, that raise_softirq()
> > could result in the RCU_SOFTIRQ handler running upon return from that
> > interrupt, and if there are a great many callbacks ready to invoke,
> > this RCU_SOFTIRQ handler might take quite some time. Plus that
> > handler could itself be interrupted.
> >
> > If long-running RCU_SOFTIRQ handlers are a problem, you can boot with
> > rcutree.use_softirq=0, which puts that handler into a kthread, which
> > in turn give the scheduler more control. However, this will also turn
> > a lightweight raise_softirq() into a rather heavier weight wakeup.
> > Choose wisely! ;-)
> >
> > Or am I missing your point?
>
> This is very early in the boot sequence when interrupts have not been
> enabled. What we are seeing is a WARN_ON() that is triggered by
> interrupts being enabled before they should be enabled.
>
> I was wondering if, for example, I called call_rcu() a lot *before*
> interrupts were enabled, that something could trigger that would either
> enable interrupts or indicate the task needs rescheduling?

You aren't doing call_rcu() enough to hit OOM, are you? The actual RCU
callback invocations won't happen until some time after the scheduler
starts up.

> Specifically the rescheduling part is suspect. I tracked down the call
> to a mutex_lock() which calls cond_resched(), so could rcu be
> 'encouraging' the rcu window by a reschedule request?

During boot before interrupts are enabled, RCU has not yet spawned any of
its kthreads. Therefore, all of its attempts to do wakeups would notice
a NULL task_struct pointer and refrain from actually doing the wakeup.
If it did do the wakeup, you would see a NULL-pointer exception. See
for example, invoke_rcu_core_kthread(), though that won't happen unless
you booted with rcutree.use_softirq=0.

Besides, since when did doing a wakeup enable interrupts? That would
make it hard to do wakeups from hardware interrupt handlers, not?

But why not put some WARN_ON_ONCE(!irqs_disabled()) calls in the areas
of greatest suspicion, starting from the stack trace generated by that
mutex_lock()? A stray interrupt-enable could be pretty much anywhere.

But where are those call_rcu() invocations? Before rcu_init()?
Presumably before init is spawned and the early_init() calls.

And what is the RCU-related Kconfig and boot-parameter setup?

Thanx, Paul