Re: [tip: core/rcu] rcu: Enable tick for nohz_full CPUs slow to provide expedited QS

From: Paul E. McKenney
Date: Sat Jan 25 2020 - 11:11:00 EST


On Sat, Jan 25, 2020 at 02:14:26PM +0100, Borislav Petkov wrote:
> Hi Paul,
>
> On Sat, Jan 25, 2020 at 10:42:56AM -0000, tip-bot2 for Paul E. McKenney wrote:
> > The following commit has been merged into the core/rcu branch of tip:
> >
> > Commit-ID: df1e849ae4559544ff00ff5052eefe2479750539
> > Gitweb: https://git.kernel.org/tip/df1e849ae4559544ff00ff5052eefe2479750539
> > Author: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > AuthorDate: Wed, 27 Nov 2019 16:36:45 -08:00
> > Committer: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > CommitterDate: Mon, 09 Dec 2019 12:32:59 -08:00
> >
> > rcu: Enable tick for nohz_full CPUs slow to provide expedited QS
> >
> > An expedited grace period can be stalled by a nohz_full CPU looping
> > in kernel context. This possibility is currently handled by some
> > carefully crafted checks in rcu_read_unlock_special() that enlist help
> > from ksoftirqd when permitted by the scheduler. However, it is exactly
> > these checks that require the scheduler avoid holding any of its rq or
> > pi locks across rcu_read_unlock() without also having held them across
> > the entire RCU read-side critical section.
> >
> > It would therefore be very nice if expedited grace periods could
> > handle nohz_full CPUs looping in kernel context without such checks.
> > This commit therefore adds code to the expedited grace period's wait
> > and cleanup code that forces the scheduler-clock interrupt on for CPUs
> > that fail to quickly supply a quiescent state. "Quickly" is currently
> > a hard-coded single-jiffy delay.
> >
> > Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
> > ---
> > include/linux/tick.h | 5 +++-
> > kernel/rcu/tree.h | 1 +-
> > kernel/rcu/tree_exp.h | 52 ++++++++++++++++++++++++++++++++++++------
> > 3 files changed, 50 insertions(+), 8 deletions(-)
>
> ...
>
> > @@ -460,22 +489,31 @@ static void synchronize_rcu_expedited_wait(void)
> > unsigned long jiffies_start;
> > unsigned long mask;
> > int ndetected;
> > + struct rcu_data *rdp;
> > struct rcu_node *rnp;
> > struct rcu_node *rnp_root = rcu_get_root();
> > - int ret;
> >
> > trace_rcu_exp_grace_period(rcu_state.name, rcu_exp_gp_seq_endval(), TPS("startwait"));
> > jiffies_stall = rcu_jiffies_till_stall_check();
> > jiffies_start = jiffies;
> > + if (IS_ENABLED(CONFIG_NO_HZ_FULL)) {
> > + if (synchronize_rcu_expedited_wait_once(1))
> > + return;
> > + rcu_for_each_leaf_node(rnp) {
> > + for_each_leaf_node_cpu_mask(rnp, cpu, rnp->expmask) {
> > + rdp = per_cpu_ptr(&rcu_data, cpu);
> > + if (rdp->rcu_forced_tick_exp)
> > + continue;
> > + rdp->rcu_forced_tick_exp = true;
> > + tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP);
> > + }
> > + }
> > + WARN_ON_ONCE(1);
>
> I'm hitting this on a big AMD box, CONFIG_NO_HZ_FULL=y.

How big? (Seriously, given that the fix may depend on the number of CPUs.)

So the problem appears to be that some of the boot-time processing
is looping in the kernel, which is preventing the grace period from
completing. One could argue that such code should be fixed, but on the
other hand, boot time is a bit special. Later in -rcu's dev branch,
there are commits that forgive this boot-time misbehavior, but this is
a bit late in process to dump all of those commits into -tip.

The RT guys might need the warning, and it was them that I was thinking
of when adding it. But let's see what works for mainline first. And
since your box was booting fine without the warning before, I bet that
it boots just fine with that warning removed.

So could you please try out the (untested) patch below?

If that works, I will re-introduce the warning with proper protection
for the merge window following this coming one.

Thanx, Paul

> Kernel is Linus master + tip/master from today.
> ...
> [ 23.094781] BTRFS info (device sdb2): disk space caching is enabled
> [ 23.094784] BTRFS info (device sdb2): has skinny extents
> [ 23.139134] BTRFS info (device sdb2): enabling ssd optimizations
> [ 23.395434] ------------[ cut here ]------------
> [ 23.402616] WARNING: CPU: 6 PID: 1921 at kernel/rcu/tree_exp.h:511 rcu_exp_wait_wake+0x11f/0x740
> [ 23.402616] Modules linked in: btrfs(E) libcrc32c(E) xor(E) hid_generic(E) usbhid(E) raid6_pq(E) sd_mod(E) crc32c_intel(E) ast(E) i2c_algo_bit(E) drm_vram_helper(E) drm_ttm_helper(E) ttm(E) ahci(E) drm_kms_helper(E) syscopyarea(E) libahci(E) sysfillrect(E) xhci_pci(E) sysimgblt(E) fb_sys_fops(E) xhci_hcd(E) libata(E) drm(E) usbcore(E) wmi(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) scsi_dh_alua(E) scsi_mod(E) efivarfs(E)
> [ 23.402630] CPU: 6 PID: 1921 Comm: kworker/6:2 Tainted: G E 5.5.0-rc7+ #1
> [ 23.402632] Workqueue: rcu_gp wait_rcu_exp_gp
> [ 23.402634] RIP: 0010:rcu_exp_wait_wake+0x11f/0x740
> [ 23.402636] Code: 3b 45 6c 7e be 48 63 05 8b 30 50 01 49 81 c5 40 02 00 00 48 8d 04 c0 48 c1 e0 06 48 05 80 b2 67 b5 49 39 c5 0f 82 72 ff ff ff <0f> 0b 49 c7 c7 00 fd 02 00 bb 01 00 00 00 4c 89 e7 e8 ab e7 ff ff
> [ 23.402637] RSP: 0018:ffffbde38a8f3e50 EFLAGS: 00010246
> [ 23.402637] RAX: ffffffffb567d8c0 RBX: ffffffffb567d6c0 RCX: 0000000000000040
> [ 23.402638] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffb567d6c0
> [ 23.402638] RBP: 0000000000000000 R08: 0000000000000002 R09: 0000000000000001
> [ 23.402639] R10: ffffbde3802cfdc0 R11: 0000000000000271 R12: 0000000000003a98
> [ 23.402640] R13: ffffffffb567d8c0 R14: 00000000000000f0 R15: 000000000002fd00
> [ 23.402640] FS: 0000000000000000(0000) GS:ffffa016cd380000(0000) knlGS:0000000000000000
> [ 23.402641] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 23.402641] CR2: 00007fc9957c5180 CR3: 0000800bb3e0a000 CR4: 0000000000340ee0
> [ 23.402642] Call Trace:
> [ 23.402647] ? sync_rcu_exp_select_cpus+0x219/0x3e0
> [ 23.402650] process_one_work+0x20b/0x400
> [ 23.402652] worker_thread+0x2d/0x3f0
> [ 23.402653] ? process_one_work+0x400/0x400
> [ 23.402655] kthread+0x10d/0x130
> [ 23.402656] ? kthread_park+0x90/0x90
> [ 23.402660] ret_from_fork+0x27/0x50
> [ 23.402663] ---[ end trace 85fb288edc35e984 ]---
> [ 24.322452] BTRFS info (device sdb2): disk space caching is enabled
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index 6935a9e..dcbd757 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -508,7 +508,6 @@ static void synchronize_rcu_expedited_wait(void)
tick_dep_set_cpu(cpu, TICK_DEP_BIT_RCU_EXP);
}
}
- WARN_ON_ONCE(1);
}

for (;;) {