Re: [PATCH] rcu: Check and report missed fqs timer wakeup on RCU stall

From: Neeraj Upadhyay
Date: Mon Nov 16 2020 - 11:05:45 EST


Hi Paul,

On 11/12/2020 1:01 AM, Paul E. McKenney wrote:
On Wed, Nov 11, 2020 at 07:37:37PM +0530, Neeraj Upadhyay wrote:
For a new grace period request, RCU GP kthread transitions
through following states:

a. [RCU_GP_WAIT_GPS] -> [RCU_GP_DONE_GPS]

Initial state, where GP kthread waits for a new GP start
request is RCU_GP_WAIT_GPS. On new GP request (started by
callers, for any new callbacks, by setting RCU_GP_FLAG_INIT
gp_state flag and waking up the GP kthread) GP kthread
enters RCU_GP_DONE_GPS.

b. [RCU_GP_DONE_GPS] -> [RCU_GP_ONOFF]

Grace period initialization starts in rcu_gp_init(), which
records the start of new GP in rcu_state.gp_seq and enters
into RCU_GP_ONOFF state.

c. [RCU_GP_ONOFF] -> [RCU_GP_INIT]

In RCU_GP_ONOFF state, for each leaf rnp node, GP kthread
applies the buffered online/offline information of its cpus,
from ->qsmaskinitnext to ->qsmaskinit, which is basically
the mask of cpus, which need to report quiescent state, for
the new GP to complete.

Also, in this state, an outgoing rnp's (for which all cpus
are now offline and there are no tasks blocked inside
RCU read section) or an incoming rnp's (for which first cpu
comes online) participation in the new and subsequent GP is
advertised, by propagating its qsmaskinit information up the
tree.

d. [RCU_GP_INIT] -> [RCU_GP_WAIT_FQS]

In RCU_GP_INIT state, current GPs qs mask information and new
GP seq is propagated down the tree, into all rnp nodes,
in breadth first order.

On GP initialization completion, GP kthread enters a fqs loop,
which does following things, to get quiescent state reported
for cpus, which aren't quiesce in ->qsmask:

i. For CPUs, which have entered idle since the first iteration,
report quiescent state up the tree.

ii. Based on how long the current grace period has been running
for, either, set the appropriate flags, so that sched clock
interrupt on that cpu, does qs reporting or do a resched
on that cpu.

On each iteration, it transitions through following states. The fqs
loop is terminated on GP completion, when all CPUs report their
quiescent state and all RCU readers, blocking current grace period
have completed the RCU read section.

e. [RCU_GP_WAIT_FQS] -> [RCU_GP_DOING_FQS]

GP kthread uses a timed wait (jiffies_till_first_fqs for first
iteration and jiffies_till_next_fqs for subsequent iterations),
before doing all the work, to force quiescent state on
all cpus. It wakes up from this state, either on timeout, or
when (RCU_GP_FLAG_FQS | RCU_GP_FLAG_OVLD) flags are set, either
on callback overload conditions or when last cpu reports its
quiescent state and all RCU readers blocking current GP, have
left the RCU read section.

f. [RCU_GP_CLEANUP] -> [RCU_GP_CLEANED]

Mark end of grace period, in ->gp_seq, on all rnp nodes, in breadth
first order and finally in rcu_state.

For cases where timers are not served (due to issues in timer
configuration, in timer framework or due to softirqs not getting
handled, when there is a storm of interrupts) on the CPU,
where GP kthread queued a timer (for timed wait, which is done
in RCU_GP_WAIT_FQS) its possible that RCU kthread never wakes up.
Report the same from stall warnings, if GP thread is in RCU_GP_WAIT_FQS
state, and the timeout has elapsed and the kthread is not woken.

Signed-off-by: Neeraj Upadhyay <neeraju@xxxxxxxxxxxxxx>
---
kernel/rcu/tree.c | 25 +++++++++++++++----------
kernel/rcu/tree_stall.h | 33 +++++++++++++++++++++++++++++++++
2 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 413831b..804e543 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1767,7 +1767,7 @@ static bool rcu_gp_init(void)
* go offline later. Please also refer to "Hotplug CPU" section
* of RCU's Requirements documentation.
*/
- rcu_state.gp_state = RCU_GP_ONOFF;
+ WRITE_ONCE(rcu_state.gp_state, RCU_GP_ONOFF);
rcu_for_each_leaf_node(rnp) {
smp_mb(); // Pair with barriers used when updating ->ofl_seq to odd values.
firstseq = READ_ONCE(rnp->ofl_seq);
@@ -1833,7 +1833,7 @@ static bool rcu_gp_init(void)
* The grace period cannot complete until the initialization
* process finishes, because this kthread handles both.
*/
- rcu_state.gp_state = RCU_GP_INIT;
+ WRITE_ONCE(rcu_state.gp_state, RCU_GP_INIT);
rcu_for_each_node_breadth_first(rnp) {
rcu_gp_slow(gp_init_delay);
raw_spin_lock_irqsave_rcu_node(rnp, flags);
@@ -1932,17 +1932,22 @@ static void rcu_gp_fqs_loop(void)
ret = 0;
for (;;) {
if (!ret) {
- rcu_state.jiffies_force_qs = jiffies + j;
+ WRITE_ONCE(rcu_state.jiffies_force_qs, jiffies + j);
+ /*
+ * jiffies_force_qs before RCU_GP_WAIT_FQS state
+ * update; required for stall checks.
+ */
+ smp_wmb();
WRITE_ONCE(rcu_state.jiffies_kick_kthreads,
jiffies + (j ? 3 * j : 2));
}
trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq,
TPS("fqswait"));
- rcu_state.gp_state = RCU_GP_WAIT_FQS;
+ WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_FQS);
ret = swait_event_idle_timeout_exclusive(
rcu_state.gp_wq, rcu_gp_fqs_check_wake(&gf), j);
rcu_gp_torture_wait();
- rcu_state.gp_state = RCU_GP_DOING_FQS;
+ WRITE_ONCE(rcu_state.gp_state, RCU_GP_DOING_FQS);
/* Locking provides needed memory barriers. */
/* If grace period done, leave loop. */
if (!READ_ONCE(rnp->qsmask) &&
@@ -2056,7 +2061,7 @@ static void rcu_gp_cleanup(void)
trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq, TPS("end"));
rcu_seq_end(&rcu_state.gp_seq);
ASSERT_EXCLUSIVE_WRITER(rcu_state.gp_seq);
- rcu_state.gp_state = RCU_GP_IDLE;
+ WRITE_ONCE(rcu_state.gp_state, RCU_GP_IDLE);
/* Check for GP requests since above loop. */
rdp = this_cpu_ptr(&rcu_data);
if (!needgp && ULONG_CMP_LT(rnp->gp_seq, rnp->gp_seq_needed)) {
@@ -2095,12 +2100,12 @@ static int __noreturn rcu_gp_kthread(void *unused)
for (;;) {
trace_rcu_grace_period(rcu_state.name, rcu_state.gp_seq,
TPS("reqwait"));
- rcu_state.gp_state = RCU_GP_WAIT_GPS;
+ WRITE_ONCE(rcu_state.gp_state, RCU_GP_WAIT_GPS);
swait_event_idle_exclusive(rcu_state.gp_wq,
READ_ONCE(rcu_state.gp_flags) &
RCU_GP_FLAG_INIT);
rcu_gp_torture_wait();
- rcu_state.gp_state = RCU_GP_DONE_GPS;
+ WRITE_ONCE(rcu_state.gp_state, RCU_GP_DONE_GPS);
/* Locking provides needed memory barrier. */
if (rcu_gp_init())
break;
@@ -2115,9 +2120,9 @@ static int __noreturn rcu_gp_kthread(void *unused)
rcu_gp_fqs_loop();
/* Handle grace-period end. */
- rcu_state.gp_state = RCU_GP_CLEANUP;
+ WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANUP);
rcu_gp_cleanup();
- rcu_state.gp_state = RCU_GP_CLEANED;
+ WRITE_ONCE(rcu_state.gp_state, RCU_GP_CLEANED);
}
}
diff --git a/kernel/rcu/tree_stall.h b/kernel/rcu/tree_stall.h
index 70d48c5..4ff7990 100644
--- a/kernel/rcu/tree_stall.h
+++ b/kernel/rcu/tree_stall.h
@@ -468,6 +468,37 @@ static void rcu_check_gp_kthread_starvation(void)
}
}
+/* Complain about missing wakeups from expired fqs wait timer */
+static void rcu_check_gp_kthread_expired_fqs_timer(void)
+{
+ struct task_struct *gpk = rcu_state.gp_kthread;
+ short gp_state;
+ unsigned long jiffies_fqs;
+ int cpu;
+
+ gp_state = READ_ONCE(rcu_state.gp_state);
+ /*
+ * Order reads of .gp_state and .jiffies_force_qs.
+ * Matching smp_wmb() is present in rcu_gp_fqs_loop().
+ */
+ smp_rmb();
+ jiffies_fqs = READ_ONCE(rcu_state.jiffies_force_qs);

Or you could use smp_load_acquire() instead of the smp_rmb()/READ_ONCE()
combination. (Other parts of the stall-warning code might also be able
to do this, given that they predate smp_load_acquire().)


Done for this part; will check other parts for cleanup.

+
+ if (gp_state == RCU_GP_WAIT_FQS &&
+ time_before(jiffies + RCU_STALL_MIGHT_MIN, jiffies_fqs) &&
+ gpk && !READ_ONCE(gpk->on_rq)) {
+ cpu = task_cpu(gpk);
+ pr_err("%s kthread timer wakeup didn't happen for %ld jiffies! g%ld f%#x %s(%d) ->state=%#lx\n",
+ rcu_state.name, (jiffies - jiffies_fqs),
+ (long)rcu_seq_current(&rcu_state.gp_seq),
+ data_race(rcu_state.gp_flags),
+ gp_state_getname(RCU_GP_WAIT_FQS), RCU_GP_WAIT_FQS,
+ gpk->state);
+ pr_err("\tPossible timer handling issue on cpu=%d timer-softirq=%u\n",
+ cpu, kstat_softirqs_cpu(TIMER_SOFTIRQ, cpu));
+ }

Otherwise looks plausible.

+}
+
static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps)
{
int cpu;
@@ -530,6 +561,7 @@ static void print_other_cpu_stall(unsigned long gp_seq, unsigned long gps)
jiffies + 3 * rcu_jiffies_till_stall_check() + 3);
rcu_check_gp_kthread_starvation();
+ rcu_check_gp_kthread_expired_fqs_timer();
panic_on_rcu_stall();
@@ -565,6 +597,7 @@ static void print_cpu_stall(unsigned long gps)
(long)rcu_seq_current(&rcu_state.gp_seq), totqlen);
rcu_check_gp_kthread_starvation();
+ rcu_check_gp_kthread_expired_fqs_timer();

I don't have an opinion about the ordering of this output within the
RCU CPU stall warning message, but I do encourage you to try out a
few different options.


I think all stall warnings which describe GP kthread state - like starvation, timer wakeup, can be kept close to each other?

Here is one possible way to test this code:

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 --configs TREE09 --bootargs "rcutorture.stall_cpu=25 rcutorture.stall_cpu_irqsoff=1" --trust-make

As an alternative to the above "--bootargs" parameter, the
rcutorture.stall_gp_kthread module parameter should also be interesting.
If I remember correctly, it should result in a stall without causing
your timer code above to print anything.

See recent paulmck.livejournal.com posts for additional information
on rcutorture and running it.

Thanx, Paul


Thanks! Wasn't able to get the warning with below:
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 --configs TREE09 --bootargs "rcutorture.stall_cpu=25 rcutorture.stall_cpu_irqsoff=1" --trust-make

Possibly, for single cpu case, while interrupts are disabled, stall
checks didn't happen?

Was able to reproduce with

tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 2 --configs TREE01 --bootargs "rcutorture.stall_gp_kthread=50" --trust-make

Below is the log output; I moved the warning above starvation logs,
so that its visible before nmi backtrace printed from starvation logs:

[ 32.780341] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 32.782062] (detected by 4, t=21003 jiffies, g=981, q=36134)
[ 32.782065] rcu: All QSes seen, last rcu_preempt kthread activity 20958 (4294699485-4294678527), jiffies_till_next_fqs=3, root ->qsmask 0x0
[ 32.786652] rcu: rcu_preempt kthread timer wakeup didn't happen for 20957 jiffies! g981 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x402
[ 32.788868] rcu: Possible timer handling issue on cpu=4 timer-softirq=11142
[ 32.789946] rcu: rcu_preempt kthread starved for 20960 jiffies! g981 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=4 fqs: 4294678530/4294699487
[ 32.791866] rcu: Unless rcu_preempt kthread gets sufficient CPU time, OOM is now expected behavior.
[ 32.793258] rcu: RCU grace-period kthread stack dump:
[ 32.794046] task:rcu_preempt state:I stack:14624 pid: 10 ppid: 2 flags:0x00004000
[ 32.795326] Call Trace:
[ 32.795718] __schedule+0x269/0x6b0
[ 32.796262] schedule+0x41/0xe0
[ 32.796704] schedule_timeout+0x19c/0x2e0



Thanks
Neeraj

--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation