Re: [PATCH] rcu_sync: simplify the state machine, introduce __rcu_sync_enter()

From: Paul E. McKenney
Date: Wed Jul 20 2016 - 16:58:39 EST


On Wed, Jul 20, 2016 at 05:13:58PM +0200, Oleg Nesterov wrote:
> On 07/19, Paul E. McKenney wrote:
> >
> > On Sat, Jul 16, 2016 at 07:10:07PM +0200, Oleg Nesterov wrote:
> >
> > > And, there is another possible transition, GP_ENTER -> GP_IDLE, because
> > > not it is possible to call __rcu_sync_enter() and rcu_sync_exit() in any
> > > state (except obviously they should be balanced), and they do not block.
> >
> > Hmmm... I am assuming that the only threads allowed to invoke
> > rcu_sync_exit() are those that have executed an unpaired rcu_sync_enter().
> > With that assumption, rcu_sync_enter() is illegal if GP_IDLE or GP_ENTER,
> > because the rcu_sync_enter() calls won't return until the state reaches
> > GP_PASSED.
>
> Not sure I understand... Obviously rcu_sync_exit() can only be used if
> it pairs with the preceeding __rcu_sync_enter() or rcu_sync_enter(), if
> nothing else rsp->gp_count-- must not underflow.

Agreed, just wondering what the no-wait use case is.

> rcu_sync_enter() or __rcu_sync_enter() is legal in any state, the latter
> won't block.

Actually, I had no idea that __rcu_sync_enter() was intended for anything
other than internal use.

Other than that, agreed, with the exception that it is illegal after
rcu_sync_dtor() has been called.

> > If you feel strongly about allowing rcu_sync_exit() in GP_ENTER state,
> > could you please tell me your use case? Or am I confused?
>
> I'll write another email, let me reply to the code review first. And
> thanks for your review!

Fair enough, hope it is helpful.

> > > The only blocking call is __rcu_sync_wait() which actually waits for GP.
> > > Obviously should only be called if gp_count != 0, iow after __rcu_sync_enter.
> >
> > Well, rcu_sync_dtor() as well. But to your point, it is blocking in
> > rcu_barrier(), not in wait_event().
>
> Yes, yes, sure. I mean't the only blocking call in enter/exit paths.

Pedantic of me, isn't it? ;-)

> > > #define rss_lock gp_wait.lock
> >
> > Interesting... Should there instead be a wait_queue_lock() and
> > wait_queue_unlock()? If someone changes how wait-queue locking work,
> > this one might be a bit tricky.
>
> Well, the current version does the same... I do not mind to add
> wait_queue_lock(), but personally I do not think this would be better.
> Either way the code should not use gp_wait.lock directly, to make it
> clear that we abuse the spinlock we already have in rcu_sync->gp_wait.
>
> To me
>
> spin_lock_irq(&rsp->rss_lock);
> spin_unlock_irq(&rsp->rss_lock);
>
> looks a bit better than
>
> wait_queue_lock();
> wait_queue_unlock();
>
> because we need to protect the state, not the wait-queue. And if
> someone changes how wait-queue locking work, we can just actually
> add "spinlock_t rss_lock" into rcu_sync and remove that "#define".

I don't feel all that strongly about it, but it did look strange.

In any case, please don't reach inside RCU in this way without asking
me first! ;-)

> > > static void rcu_sync_call(struct rcu_sync *rsp)
> > > {
> > > // TODO: THIS IS SUBOPTIMAL. We want to call it directly
> > > // if rcu_blocking_is_gp() == T, but it has might_sleep().
> >
> > Careful! This optimization works only for RCU-sched and RCU-bh.
> > With normal RCU, you can be tripped up by tasks preempted within RCU
> > read-side critical sections should CONFIG_PREEMPT=y.
>
> Yes, thanks, I understand ;) another reason why I do not want to add
> this optimization into the initial version.

So I should take this as a request to export rcu_blocking_is_gp()?

I would probably need to change the name. Or comment it more heavily.
Or hand out coccinelle scripts looking for this pattern:

if (!rcu_blocking_is_gp())
synchronize_rcu();

Or maybe all of the above.

> > > static void rcu_sync_func(struct rcu_head *rcu)
> > > {
> > > struct rcu_sync *rsp = container_of(rcu, struct rcu_sync, cb_head);
> > > unsigned long flags;
> > >
> > > BUG_ON(rsp->gp_state == GP_IDLE);
> >
> > If you really want rcu_sync_exit() to be invoked before the corresponding
> > rcu_sync_enter() returns, and if you want that to set the state to
> > GP_IDLE, then the above BUG_ON() can fire. This is why I was assuming
> > that rcu_sync_enter() must return before the corresponding rcu_sync_exit()
> > can start.
>
> Again, can't understand. Let me repeat, of course rcu_sync_exit() can only
> be called after __rcu_sync_enter() (or rcu_sync_enter() which does
> __rcu_sync_enter + wait), they obviously should be balanced.
>
> But, it should be fine to call rcu_sync_exit() right after __rcu_sync_enter()
> returns, without waiting for GP_PASSED, iow without __rcu_sync_wait().
>
> But probably I misunderstood you... Either way, rcu_sync_func() can never
> hit gp_state == GP_IDLE. Because this state can only be set by rcu_sync_func(),
> nobody else can set this state. Ignoring the initial state of course.
>
> And every caller of rcu_sync_call() (which does call_rcu(rcu_sync_func))
> updates the state first:
>
> rsp->gp_state = GP_ENTER; // __rcu_sync_enter()
> rcu_sync_call(rsp);
>
> rsp->gp_state = GP_EXIT; // rcu_sync_exit() or REPLAY
> rcu_sync_call(rsp);
>
> so I simply can't understand your concern.

You are right, the state is always set to something other than GP_IDLE
before the callback is registered. Never mind!

> > > BUG_ON(rsp->gp_state == GP_PASSED);
> >
> > READ_ONCE() would feel better in both of the above BUG_ON()s, given that
> > we are not yet holding the lock.
>
> The lockless checks are fine, both states can only be set by the callback.
> As for READ_ONCE(), please see below.
>
> > > spin_lock_irqsave(&rsp->rss_lock, flags);
> > > if (rsp->gp_count) {
> > > /*
> > > * We're at least a GP after the first __rcu_sync_enter().
> > > */
> >
> > Don't we need a wakeup here?
>
> Heh ;) let me show you what I have in kernel/rcu/sync.c~
>
> if (rsp->gp_count) {
> /*
> * COMMENT.
> */
> rsp->gp_state = GP_PASSED;
> wake_up_locked(&rsp->gp_wait);
> } else if (rsp->gp_state == GP_REPLAY) {
>
> apparently I removed this wake_up_locked() by accident when I edited
> the "COMMENT" placeholder ;)

Sounds like something that I might do! ;-)

> Thanks!

No problem -- you have done the same for me many times!

> > > bool __rcu_sync_enter(struct rcu_sync *rsp)
> > > {
> > > int gp_count, gp_state;
> > >
> > > spin_lock_irq(&rsp->rss_lock);
> > > gp_count = rsp->gp_count++;
> > > gp_state = rsp->gp_state;
> > > if (gp_state == GP_IDLE) {
> > > rsp->gp_state = GP_ENTER;
> > > rcu_sync_call(rsp);
> > > }
> > > spin_unlock_irq(&rsp->rss_lock);
> > >
> > > BUG_ON(gp_count != 0 && gp_state == GP_IDLE);
> > > BUG_ON(gp_count == 0 && gp_state == GP_PASSED);
> >
> > Isn't gp_count==0 illegal here regardless of gp_state value?
>
> No. gp_count is the old value of rsp->gp_count, before it was
> incremented by us.

Got it -- blindness on my part, I guess.

> > > return gp_state < GP_PASSED;
> >
> > And the above statement is another reason why I believe that it
> > should be illegal to invoke rcu_sync_exit() until after the matching
> > rcu_sync_enter() has returned. If we were preempted just before the
> > above "return" statement, we might be in GP_IDLE state upon return,
>
> How? Note the GP_IDLE -> GP_ENTER transition above, it can't be GP_IDLE.
> And since we incremented rsp->gp_count, GP_IDLE is not possibly until
> rcu_sync_exit() which pairs with this "enter" decrements the counter
> and fires the RCU callback which will set GP_IDLE after GP.

Right -- I was again confusing gp_state with rsp->gp_state.

> > > void __rcu_sync_wait(struct rcu_sync *rsp)
> > > {
> > > BUG_ON(rsp->gp_state == GP_IDLE);
> > > BUG_ON(rsp->gp_count == 0);
> > >
> > > wait_event(rsp->gp_wait, rsp->gp_state >= GP_PASSED);
> >
> > I would feel better if all the references to ->gp_state and ->gp_count
> > outside of the lock be READ_ONCE(). Compilers can be tricky beasts.
>
> Oh. I won't argue too much, but I do not agree.
>
> Because I started to hate these _ONCE() helpers a long ago. Why? Because
> (imo) we have too many XXX_ONCE() added "just in case, because compiler
> can be buggy".
>
> Now, when I look some particular READ_ONCE() I can almost never understand
> the reason: do we really need it for correctness? or this is another "it
> does not hurt" case?
>
> So why do we need READ_ONCE() here? How this wait_event() differs from
> any other wait_event() which does a plain LOAD? Should we "fix" them all?

I think of them as "this accesses a shared variable". I have been
learning more about what the compiler is permitted to do to normal
variables, and I am paranoid. The default compiler assumption that no
other thread is looking at or modifying the variable is a very powerful
and profound assumption, and it points to a rather stunning array of
"creative" optimizations.

It has been suggested that gcc implement a -std=kernel, which might be
a decent alternative. However, we would need to specify what we want
gcc to do (and not do) in that case.

This is not an easy thing, nor will it go away. Though I suppose that
doesn't mean we cannot ignore it. For awhile, anyway. ;-)

> > rcu_sync_enter() checks to see if the rcu_sync structure is in writer
> > mode, and, if not, initiates a transition to writer mode and waits for
> > that transition to complete. Either way, a writer reference is acquired.
> >
> > > void rcu_sync_enter(struct rcu_sync *rsp)
> > > {
> > > if (__rcu_sync_enter(rsp))
> > > __rcu_sync_wait(rsp);
> > > }
>
> Yes. And of course it could call __rcu_sync_wait() unconditionally:
>
> __rcu_sync_enter(rsp);
> __rcu_sync_wait(rsp);
>
> This "if" is more the documentation than optimization.

Understood.

> > Here is what I believe rcu_sync_exit() does:
> >
> > rcu_sync_exit() removes an updater reference, and if there are no more,
> > starts the transition to reader mode. Note that this must handle
> > transitions back to writer mode that occur before the transition to
> > reader mode has fully completed.
> >
> > > void rcu_sync_exit(struct rcu_sync *rsp)
> > > {
> > > BUG_ON(rsp->gp_state == GP_IDLE);
> >
> > BUG_ON(READ_ONCE(rsp->gp_state) == GP_ENTER);
> >
> > If you do allow rcu_sync_exit() while GP_ENTER, then all the readers

s/readers/writers/, sorry!

> > can do rcu_sync_exit() in that state, leaving the state at GP_PASSED
> > instead of the desired GP_IDLE.
>
> Can't understand this too...
>
> Firstly, readers do not do rcu_sync_enter/exit. They only use
> rcu_sync_is_idle() which checks gp_state == GP_IDLE.
>
> As for enter() path, we only need to ensure that after __rcu_sync_wait()
> returns, all CPU's must see gp_state != GP_IDLE. Yes, yes, this is not
> really true, we also need the memory barriers implied by RCU, lets ignore
> this to simplify the discussion.
>
> Now lets suppose that rcu_sync_exit() is called while GP_ENTER. This is
> only possible if __rcu_sync_wait() was not called by us. IOW, this thread
> does
>
> __rcu_sync_enter();
> rcu_sync_exit();
>
> by any reason.
>
> Now. If gp_count is not zero after we decrement it, rcu_sync_exit() does
> nothing.
>
> If it is zero, we could set GP_IDLE right now. We do not care about the
> readers in this case, exactly because __rcu_sync_wait() was not called
> and thus we can't even know if other CPU's had any chance to observe the
> gp_state != GP_IDLE state set by __rcu_sync_enter().
>
> But. We can't actually set GP_IDLE, because GP_ENTER means that the RCU
> callback is already pending and this can break the next enter(). IOW, this
> would break the "GP_IDLE is owned by __rcu_sync_enter" contract.
>
> And we do not need to do this! We rely on the already pending rcu_sync_func()
> which will set GP_IDLE _if_ the counter is still zero.
>
> Now suppose that another writer (__rcu_sync_enter) comes and increments
> gp_count right before rcu_sync_func() is called by RCU. In this case
> rcu_sync_func() will notice rsp->gp_count != 0 and correctly set GP_PASSED.
>
> Note that in this case that new writer won't wait for a full GP, quite
> possibly __rcu_sync_wait() won't even block. And this is correct, we know
> that we have at least one full GP after the previous GP_IDLE -> GP_ENTER
> transition.

Let me try laying out the sequence of events that I was concerned about,
which will hopefully either prove my point or identify my error:

o The rcu_sync structure is in state GP_IDLE with count 0 and
no CB posted. (State A in my table.)

o Task A invokes __rcu_sync_enter(), which puts the state at
GP_ENTER with count 1 and a CB posted.

o Task B also invokes __rcu_sync_enter(), which increments count
to 2, but leaves the state otherwise unchanged.

o Both Task A and Task B invoke rcu_sync_exit(), which decrements
count back down to zero, but otherwise leaves the state unchanged.
So we are at GP_ENTER with count 0.

Ah, but the CB is still pending, and when it is invoked, as you say above,
it will see that count==0 and therefore set the state to GP_IDLE.

OK, you are quite right.

> > And here is what I believe rcu_sync_dtor() does:
> >
> > rcu_sync_dtor() cleans up at end, waiting for a full transition back
> > into reader mode. Anything invoking rcu_sync_dtor() must have ensured
> > that there will be no more calls to rcu_sync_enter() and rcu_sync_exit().
>
> Sure, we are going to destroy/free this object.
>
> > > void rcu_sync_dtor(struct rcu_sync *rsp)
> > > {
> > > int gp_state;
> > >
> > > BUG_ON(rsp->gp_count);
> > > BUG_ON(rsp->gp_state == GP_PASSED);
> > >
> > > spin_lock_irq(&rsp->rss_lock);
> > > if (rsp->gp_state == GP_REPLAY)
> > > rsp->gp_state = GP_EXIT;
> >
> > OK, this ensures that the .wait() below will wait for the callback, but
> > it might result in some RCU read-side critical sections still being in
> > flight after rcu_sync_dtor() completes.
>
> Hmm. Obviously, the caller should prevent this somehow or it is simply
> buggy. Or I misunderstood.

Hard to say without knowing what the permitted use cases are...

Me, I would make rcu_sync_dtor() wait the extra grace period in this case.
It should be a low-probability race, and it reduces the _dtor-time
state space.

What it looks like you are saying is that the caller must not only ensure
that there will never again be a __rcu_sync_enter(), rcu_sync_enter(),
or rcu_sync_exit() (or, I suppose, rcu_sync_dtor()) for this rcu_sync
structure, but must also ensure that any relevant RCU read-side critical
sections have completed.

I guess that either way is OK, but whatever the rules are, they do need
to be clearly communicated. Of course, this does indicate a bug in
rcu_sync_dtor()'s current comment header. ;-)

> > In other words, any transition
> > started by an rcu_sync_exit() will normally wait until all pre-existing
> > RCU readers complete -- unless that rcu_sync_exit() set the state to
> > GP_REPLAY and was quickly followed by rcu_sync_dtor().
>
> Again, we are going to (say) free this memory. The caller must ensure
> that rcu_sync_is_idle(rsp) or anything else which can touch this object
> is not possible.
>
> The only reason for rcu_sync_dtor() is that we can't free/destroy this
> memory until we ensure that the pending rcu_sync_func() finishes. And
> of course we need to ensure it won't re-arm itself, that is why we
> do GP_REPLAY -> GP_EXIT transition.
>
> And in any case, this patch doesn't change rcu_sync_dtor() logic, it
> is the same even if ->cb_state goes away.

Agreed, this is something that I should have complained about eleven
months ago. So I am slow! ;-)

> > We need to add the first wakeup before we can add another one. ;-)
>
> Yes ;) see above.
>
> > But I don't see why we need a second wakeup, given that rcu_barrier()
> > wakes itself up.
> >
> > > // use wait_event + spin_lock_wait, remove gp_ops->wait().
> > > if (gp_state != GP_IDLE) {
> > > gp_ops[rsp->gp_type].wait();
> > > BUG_ON(rsp->gp_state != GP_IDLE);
> > > }
> > > }
>
> Because we can remove .wait() op and simplify the code a bit more. This
> also can be more efficient, altough this doesn't really matter. The main
> reason is that this would be more clear: we only need to sync with one
> particular callback, we do not need to wait for all outstanding RCU
> callbacks to complete. Anyway, this is TODO and needs another patch, we
> will discuss this later.

Yes, you could have a flag that the callback manipulates and a
wait_event()/wake_up() or similar, which would be quite a bit lighter
weight than rcu_barrier(). So it does sound worthwhile.

The only concern would be scalability, but we should run into that
problem before solving it, and the callback will be the least of
our worries.

Thanx, Paul

> > And please see below for the state table and assumptions.
>
> I'll try to read it carefully later.
>
> Thanks Paul!
>
> Oleg.
>