Re: [PATCH 01/11] rcu: avoid leaking exp_deferred_qs into next GP

From: Lai Jiangshan
Date: Thu Oct 31 2019 - 14:19:18 EST




On 2019/10/31 9:43 äå, Paul E. McKenney wrote:
On Thu, Oct 31, 2019 at 10:07:56AM +0000, Lai Jiangshan wrote:
If exp_deferred_qs is incorrectly set and leaked to the next
exp GP, it may cause the next GP to be incorrectly prematurely
completed.

Could you please provide the sequence of events leading to a such a
failure?

I just felt nervous with "leaking" exp_deferred_qs.
I didn't careful consider the sequence of events.

Now it proves that I must have misunderstood the exp_deferred_qs.
So call "leaking" is wrong concept, preempt_disable()
is considered as rcu_read_lock() and exp_deferred_qs
needs to be set.

Thanks
Lai

============don't need to read:

read_read_lock()
// other cpu start exp GP_A
preempt_schedule() // queue itself
read_read_unlock() //report qs, other cpu is sending ipi to me
preempt_disable
rcu_exp_handler() interrupt for GP_A and leave a exp_deferred_qs
// exp GP_A finished
---------------above is one possible way to leave a exp_deferred_qs
preempt_enable()
interrupt before preempt_schedule()
read_read_lock()
read_read_unlock()
NESTED interrupt when nagative rcu_read_lock_nesting
read_read_lock()
// other cpu start exp GP_B
NESTED interrupt for rcu_flavor_sched_clock_irq()
report exq qs since rcu_read_lock_nesting <0 and \
exp_deferred_qs is true
// exp GP_B complete
read_read_unlock()

This plausible sequence relies on NESTED interrupt too,
and can be avoided by patch2 if NESTED interrupt were allowed.


Also, did you provoke such a failure in testing? If so, an upgrade
to rcutorture would be good, so please tell me what you did to make
the failure happen.

I do like the reduction in state space, but I am a bit concerned about
the potential increase in contention on rnp->lock. Thoughts?

Thanx, Paul

Signed-off-by: Lai Jiangshan <laijs@xxxxxxxxxxxxxxxxx>
---
kernel/rcu/tree_exp.h | 23 ++++++++++++++---------
1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
index a0e1e51c51c2..6dec21909b30 100644
--- a/kernel/rcu/tree_exp.h
+++ b/kernel/rcu/tree_exp.h
@@ -603,6 +603,18 @@ static void rcu_exp_handler(void *unused)
struct rcu_node *rnp = rdp->mynode;
struct task_struct *t = current;
+ /*
+ * Note that there is a large group of race conditions that
+ * can have caused this quiescent state to already have been
+ * reported, so we really do need to check ->expmask first.
+ */
+ raw_spin_lock_irqsave_rcu_node(rnp, flags);
+ if (!(rnp->expmask & rdp->grpmask)) {
+ raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+ return;
+ }
+ raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+
/*
* First, the common case of not being in an RCU read-side
* critical section. If also enabled or idle, immediately
@@ -628,17 +640,10 @@ static void rcu_exp_handler(void *unused)
* a future context switch. Either way, if the expedited
* grace period is still waiting on this CPU, set ->deferred_qs
* so that the eventual quiescent state will be reported.
- * Note that there is a large group of race conditions that
- * can have caused this quiescent state to already have been
- * reported, so we really do need to check ->expmask.
*/
if (t->rcu_read_lock_nesting > 0) {
- raw_spin_lock_irqsave_rcu_node(rnp, flags);
- if (rnp->expmask & rdp->grpmask) {
- rdp->exp_deferred_qs = true;
- t->rcu_read_unlock_special.b.exp_hint = true;
- }
- raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
+ rdp->exp_deferred_qs = true;
+ WRITE_ONCE(t->rcu_read_unlock_special.b.exp_hint, true);
return;
}
--
2.20.1