Re: [PATCH] rcu_sync: simplify the state machine, introduce __rcu_sync_enter()
From: Oleg Nesterov
Date: Wed Jul 20 2016 - 11:13:55 EST
On 07/19, Paul E. McKenney wrote:
>
> On Sat, Jul 16, 2016 at 07:10:07PM +0200, Oleg Nesterov wrote:
>
> > And, there is another possible transition, GP_ENTER -> GP_IDLE, because
> > not it is possible to call __rcu_sync_enter() and rcu_sync_exit() in any
> > state (except obviously they should be balanced), and they do not block.
>
> Hmmm... I am assuming that the only threads allowed to invoke
> rcu_sync_exit() are those that have executed an unpaired rcu_sync_enter().
> With that assumption, rcu_sync_enter() is illegal if GP_IDLE or GP_ENTER,
> because the rcu_sync_enter() calls won't return until the state reaches
> GP_PASSED.
Not sure I understand... Obviously rcu_sync_exit() can only be used if
it pairs with the preceeding __rcu_sync_enter() or rcu_sync_enter(), if
nothing else rsp->gp_count-- must not underflow.
rcu_sync_enter() or __rcu_sync_enter() is legal in any state, the latter
won't block.
> If you feel strongly about allowing rcu_sync_exit() in GP_ENTER state,
> could you please tell me your use case? Or am I confused?
I'll write another email, let me reply to the code review first. And
thanks for your review!
> > The only blocking call is __rcu_sync_wait() which actually waits for GP.
> > Obviously should only be called if gp_count != 0, iow after __rcu_sync_enter.
>
> Well, rcu_sync_dtor() as well. But to your point, it is blocking in
> rcu_barrier(), not in wait_event().
Yes, yes, sure. I mean't the only blocking call in enter/exit paths.
> > #define rss_lock gp_wait.lock
>
> Interesting... Should there instead be a wait_queue_lock() and
> wait_queue_unlock()? If someone changes how wait-queue locking work,
> this one might be a bit tricky.
Well, the current version does the same... I do not mind to add
wait_queue_lock(), but personally I do not think this would be better.
Either way the code should not use gp_wait.lock directly, to make it
clear that we abuse the spinlock we already have in rcu_sync->gp_wait.
To me
spin_lock_irq(&rsp->rss_lock);
spin_unlock_irq(&rsp->rss_lock);
looks a bit better than
wait_queue_lock();
wait_queue_unlock();
because we need to protect the state, not the wait-queue. And if
someone changes how wait-queue locking work, we can just actually
add "spinlock_t rss_lock" into rcu_sync and remove that "#define".
> > static void rcu_sync_call(struct rcu_sync *rsp)
> > {
> > // TODO: THIS IS SUBOPTIMAL. We want to call it directly
> > // if rcu_blocking_is_gp() == T, but it has might_sleep().
>
> Careful! This optimization works only for RCU-sched and RCU-bh.
> With normal RCU, you can be tripped up by tasks preempted within RCU
> read-side critical sections should CONFIG_PREEMPT=y.
Yes, thanks, I understand ;) another reason why I do not want to add
this optimization into the initial version.
> > static void rcu_sync_func(struct rcu_head *rcu)
> > {
> > struct rcu_sync *rsp = container_of(rcu, struct rcu_sync, cb_head);
> > unsigned long flags;
> >
> > BUG_ON(rsp->gp_state == GP_IDLE);
>
> If you really want rcu_sync_exit() to be invoked before the corresponding
> rcu_sync_enter() returns, and if you want that to set the state to
> GP_IDLE, then the above BUG_ON() can fire. This is why I was assuming
> that rcu_sync_enter() must return before the corresponding rcu_sync_exit()
> can start.
Again, can't understand. Let me repeat, of course rcu_sync_exit() can only
be called after __rcu_sync_enter() (or rcu_sync_enter() which does
__rcu_sync_enter + wait), they obviously should be balanced.
But, it should be fine to call rcu_sync_exit() right after __rcu_sync_enter()
returns, without waiting for GP_PASSED, iow without __rcu_sync_wait().
But probably I misunderstood you... Either way, rcu_sync_func() can never
hit gp_state == GP_IDLE. Because this state can only be set by rcu_sync_func(),
nobody else can set this state. Ignoring the initial state of course.
And every caller of rcu_sync_call() (which does call_rcu(rcu_sync_func))
updates the state first:
rsp->gp_state = GP_ENTER; // __rcu_sync_enter()
rcu_sync_call(rsp);
rsp->gp_state = GP_EXIT; // rcu_sync_exit() or REPLAY
rcu_sync_call(rsp);
so I simply can't understand your concern.
> > BUG_ON(rsp->gp_state == GP_PASSED);
>
> READ_ONCE() would feel better in both of the above BUG_ON()s, given that
> we are not yet holding the lock.
The lockless checks are fine, both states can only be set by the callback.
As for READ_ONCE(), please see below.
> > spin_lock_irqsave(&rsp->rss_lock, flags);
> > if (rsp->gp_count) {
> > /*
> > * We're at least a GP after the first __rcu_sync_enter().
> > */
>
> Don't we need a wakeup here?
Heh ;) let me show you what I have in kernel/rcu/sync.c~
if (rsp->gp_count) {
/*
* COMMENT.
*/
rsp->gp_state = GP_PASSED;
wake_up_locked(&rsp->gp_wait);
} else if (rsp->gp_state == GP_REPLAY) {
apparently I removed this wake_up_locked() by accident when I edited
the "COMMENT" placeholder ;)
Thanks!
> > bool __rcu_sync_enter(struct rcu_sync *rsp)
> > {
> > int gp_count, gp_state;
> >
> > spin_lock_irq(&rsp->rss_lock);
> > gp_count = rsp->gp_count++;
> > gp_state = rsp->gp_state;
> > if (gp_state == GP_IDLE) {
> > rsp->gp_state = GP_ENTER;
> > rcu_sync_call(rsp);
> > }
> > spin_unlock_irq(&rsp->rss_lock);
> >
> > BUG_ON(gp_count != 0 && gp_state == GP_IDLE);
> > BUG_ON(gp_count == 0 && gp_state == GP_PASSED);
>
> Isn't gp_count==0 illegal here regardless of gp_state value?
No. gp_count is the old value of rsp->gp_count, before it was
incremented by us.
> > return gp_state < GP_PASSED;
>
> And the above statement is another reason why I believe that it
> should be illegal to invoke rcu_sync_exit() until after the matching
> rcu_sync_enter() has returned. If we were preempted just before the
> above "return" statement, we might be in GP_IDLE state upon return,
How? Note the GP_IDLE -> GP_ENTER transition above, it can't be GP_IDLE.
And since we incremented rsp->gp_count, GP_IDLE is not possibly until
rcu_sync_exit() which pairs with this "enter" decrements the counter
and fires the RCU callback which will set GP_IDLE after GP.
> > void __rcu_sync_wait(struct rcu_sync *rsp)
> > {
> > BUG_ON(rsp->gp_state == GP_IDLE);
> > BUG_ON(rsp->gp_count == 0);
> >
> > wait_event(rsp->gp_wait, rsp->gp_state >= GP_PASSED);
>
> I would feel better if all the references to ->gp_state and ->gp_count
> outside of the lock be READ_ONCE(). Compilers can be tricky beasts.
Oh. I won't argue too much, but I do not agree.
Because I started to hate these _ONCE() helpers a long ago. Why? Because
(imo) we have too many XXX_ONCE() added "just in case, because compiler
can be buggy".
Now, when I look some particular READ_ONCE() I can almost never understand
the reason: do we really need it for correctness? or this is another "it
does not hurt" case?
So why do we need READ_ONCE() here? How this wait_event() differs from
any other wait_event() which does a plain LOAD? Should we "fix" them all?
> rcu_sync_enter() checks to see if the rcu_sync structure is in writer
> mode, and, if not, initiates a transition to writer mode and waits for
> that transition to complete. Either way, a writer reference is acquired.
>
> > void rcu_sync_enter(struct rcu_sync *rsp)
> > {
> > if (__rcu_sync_enter(rsp))
> > __rcu_sync_wait(rsp);
> > }
Yes. And of course it could call __rcu_sync_wait() unconditionally:
__rcu_sync_enter(rsp);
__rcu_sync_wait(rsp);
This "if" is more the documentation than optimization.
> Here is what I believe rcu_sync_exit() does:
>
> rcu_sync_exit() removes an updater reference, and if there are no more,
> starts the transition to reader mode. Note that this must handle
> transitions back to writer mode that occur before the transition to
> reader mode has fully completed.
>
> > void rcu_sync_exit(struct rcu_sync *rsp)
> > {
> > BUG_ON(rsp->gp_state == GP_IDLE);
>
> BUG_ON(READ_ONCE(rsp->gp_state) == GP_ENTER);
>
> If you do allow rcu_sync_exit() while GP_ENTER, then all the readers
> can do rcu_sync_exit() in that state, leaving the state at GP_PASSED
> instead of the desired GP_IDLE.
Can't understand this too...
Firstly, readers do not do rcu_sync_enter/exit. They only use
rcu_sync_is_idle() which checks gp_state == GP_IDLE.
As for enter() path, we only need to ensure that after __rcu_sync_wait()
returns, all CPU's must see gp_state != GP_IDLE. Yes, yes, this is not
really true, we also need the memory barriers implied by RCU, lets ignore
this to simplify the discussion.
Now lets suppose that rcu_sync_exit() is called while GP_ENTER. This is
only possible if __rcu_sync_wait() was not called by us. IOW, this thread
does
__rcu_sync_enter();
rcu_sync_exit();
by any reason.
Now. If gp_count is not zero after we decrement it, rcu_sync_exit() does
nothing.
If it is zero, we could set GP_IDLE right now. We do not care about the
readers in this case, exactly because __rcu_sync_wait() was not called
and thus we can't even know if other CPU's had any chance to observe the
gp_state != GP_IDLE state set by __rcu_sync_enter().
But. We can't actually set GP_IDLE, because GP_ENTER means that the RCU
callback is already pending and this can break the next enter(). IOW, this
would break the "GP_IDLE is owned by __rcu_sync_enter" contract.
And we do not need to do this! We rely on the already pending rcu_sync_func()
which will set GP_IDLE _if_ the counter is still zero.
Now suppose that another writer (__rcu_sync_enter) comes and increments
gp_count right before rcu_sync_func() is called by RCU. In this case
rcu_sync_func() will notice rsp->gp_count != 0 and correctly set GP_PASSED.
Note that in this case that new writer won't wait for a full GP, quite
possibly __rcu_sync_wait() won't even block. And this is correct, we know
that we have at least one full GP after the previous GP_IDLE -> GP_ENTER
transition.
> And here is what I believe rcu_sync_dtor() does:
>
> rcu_sync_dtor() cleans up at end, waiting for a full transition back
> into reader mode. Anything invoking rcu_sync_dtor() must have ensured
> that there will be no more calls to rcu_sync_enter() and rcu_sync_exit().
Sure, we are going to destroy/free this object.
> > void rcu_sync_dtor(struct rcu_sync *rsp)
> > {
> > int gp_state;
> >
> > BUG_ON(rsp->gp_count);
> > BUG_ON(rsp->gp_state == GP_PASSED);
> >
> > spin_lock_irq(&rsp->rss_lock);
> > if (rsp->gp_state == GP_REPLAY)
> > rsp->gp_state = GP_EXIT;
>
> OK, this ensures that the .wait() below will wait for the callback, but
> it might result in some RCU read-side critical sections still being in
> flight after rcu_sync_dtor() completes.
Hmm. Obviously, the caller should prevent this somehow or it is simply
buggy. Or I misunderstood.
> In other words, any transition
> started by an rcu_sync_exit() will normally wait until all pre-existing
> RCU readers complete -- unless that rcu_sync_exit() set the state to
> GP_REPLAY and was quickly followed by rcu_sync_dtor().
Again, we are going to (say) free this memory. The caller must ensure
that rcu_sync_is_idle(rsp) or anything else which can touch this object
is not possible.
The only reason for rcu_sync_dtor() is that we can't free/destroy this
memory until we ensure that the pending rcu_sync_func() finishes. And
of course we need to ensure it won't re-arm itself, that is why we
do GP_REPLAY -> GP_EXIT transition.
And in any case, this patch doesn't change rcu_sync_dtor() logic, it
is the same even if ->cb_state goes away.
> We need to add the first wakeup before we can add another one. ;-)
Yes ;) see above.
> But I don't see why we need a second wakeup, given that rcu_barrier()
> wakes itself up.
>
> > // use wait_event + spin_lock_wait, remove gp_ops->wait().
> > if (gp_state != GP_IDLE) {
> > gp_ops[rsp->gp_type].wait();
> > BUG_ON(rsp->gp_state != GP_IDLE);
> > }
> > }
Because we can remove .wait() op and simplify the code a bit more. This
also can be more efficient, altough this doesn't really matter. The main
reason is that this would be more clear: we only need to sync with one
particular callback, we do not need to wait for all outstanding RCU
callbacks to complete. Anyway, this is TODO and needs another patch, we
will discuss this later.
> And please see below for the state table and assumptions.
I'll try to read it carefully later.
Thanks Paul!
Oleg.