Re: [PATCH RFC kenrel/rcu] Eliminate BUG_ON() for sync.c

From: Paul E. McKenney
Date: Tue Oct 30 2018 - 13:55:49 EST


On Mon, Oct 22, 2018 at 06:14:40PM +0200, Oleg Nesterov wrote:
> On 10/22, Paul E. McKenney wrote:
> >
> > > > @@ -125,12 +125,12 @@ void rcu_sync_enter(struct rcu_sync *rsp)
> > > > rsp->gp_state = GP_PENDING;
> > > > spin_unlock_irq(&rsp->rss_lock);
> > > >
> > > > - BUG_ON(need_wait && need_sync);
> > > > -
> > > > if (need_sync) {
> > > > gp_ops[rsp->gp_type].sync();
> > > > rsp->gp_state = GP_PASSED;
> > > > wake_up_all(&rsp->gp_wait);
> > > > + if (WARN_ON_ONCE(need_wait))
> > > > + wait_event(rsp->gp_wait, rsp->gp_state == GP_PASSED);
> > >
> > > This wait_event(gp_state == GP_PASSED) is pointless, note that this branch
> > > does gp_state = GP_PASSED 2 lines above.
> >
> > OK, I have removed this one.
> >
> > > And if we add WARN_ON_ONCE(need_wait), then we should probably also add
> > > WARN_ON_ONCE(need_sync) into the next "if (need_wait)" branch just for
> > > symmetry.
> >
> > But in that case, the earlier "if" prevents "need_sync" from ever getting
> > there, unless I lost the thread here.
>
> Yes, you are right, we would also need to remove "else",
>
> > Should I remove the others?
>
> Up to you, I am fine either way.
>
> IOW, feel free to remove this BUG_ON's altogether, or turn them all into
> WARN_ON_ONCE's, whatever you like more.
>
> > > ----------------------------------------------------------------------------
> > > Damn.
> > >
> > > This suddenly reminds me that I rewrote this code completely, and you even
> > > reviewed the new implementation and (iirc) acked it!
> > >
> > > However, I failed to force myself to rewrite the comments, and that is why
> > > I didn't send the "official" patch :/
> > >
> > > May be some time...
> >
> > Could you please point me at the last email thread? Yes, I should be
> > able to find it, but I would probably get the wrong one. :-/
>
> probably this one,
>
> [PATCH] rcu_sync: simplify the state machine, introduce __rcu_sync_enter()
> https://lkml.org/lkml/2016/7/16/150
>
> but I am not sure, will recheck tomorrow.

Just following up... Here is what I currently have.

Thanx, Paul

------------------------------------------------------------------------

commit 1c1d315dfb7049d0233b89948a3fbcb61ea15d26
Author: Dennis Krein <Dennis.Krein@xxxxxxxxxx>
Date: Fri Oct 26 07:38:24 2018 -0700

srcu: Lock srcu_data structure in srcu_gp_start()

The srcu_gp_start() function is called with the srcu_struct structure's
->lock held, but not with the srcu_data structure's ->lock. This is
problematic because this function accesses and updates the srcu_data
structure's ->srcu_cblist, which is protected by that lock. Failing to
hold this lock can result in corruption of the SRCU callback lists,
which in turn can result in arbitrarily bad results.

This commit therefore makes srcu_gp_start() acquire the srcu_data
structure's ->lock across the calls to rcu_segcblist_advance() and
rcu_segcblist_accelerate(), thus preventing this corruption.

Reported-by: Bart Van Assche <bvanassche@xxxxxxx>
Reported-by: Christoph Hellwig <hch@xxxxxxxxxxxxx>
Signed-off-by: Dennis Krein <Dennis.Krein@xxxxxxxxxx>
Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxx>

diff --git a/kernel/rcu/srcutree.c b/kernel/rcu/srcutree.c
index 60f3236beaf7..697a2d7e8e8a 100644
--- a/kernel/rcu/srcutree.c
+++ b/kernel/rcu/srcutree.c
@@ -451,10 +451,12 @@ static void srcu_gp_start(struct srcu_struct *sp)

lockdep_assert_held(&ACCESS_PRIVATE(sp, lock));
WARN_ON_ONCE(ULONG_CMP_GE(sp->srcu_gp_seq, sp->srcu_gp_seq_needed));
+ spin_lock_rcu_node(sdp); /* Interrupts already disabled. */
rcu_segcblist_advance(&sdp->srcu_cblist,
rcu_seq_current(&sp->srcu_gp_seq));
(void)rcu_segcblist_accelerate(&sdp->srcu_cblist,
rcu_seq_snap(&sp->srcu_gp_seq));
+ spin_unlock_rcu_node(sdp); /* Interrupts remain disabled. */
smp_mb(); /* Order prior store to ->srcu_gp_seq_needed vs. GP start. */
rcu_seq_start(&sp->srcu_gp_seq);
state = rcu_seq_state(READ_ONCE(sp->srcu_gp_seq));