Re: [PATCH] rcu_sync: simplify the state machine, introduce __rcu_sync_enter()

From: Paul E. McKenney
Date: Tue Jul 19 2016 - 16:50:07 EST

Next message: RafaÅ MiÅecki: "[PATCH 1/2] mtd: add support for partition parsers"
Previous message: Rustad, Mark D: "Re: [Intel-wired-lan] [PATCH net] e1000e: fix PTP on e1000_pch_lpt variants"
In reply to: Oleg Nesterov: "Re: [PATCH] rcu_sync: simplify the state machine, introduce __rcu_sync_enter()"
Next in thread: Oleg Nesterov: "Re: [PATCH] rcu_sync: simplify the state machine, introduce __rcu_sync_enter()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Jul 16, 2016 at 07:10:07PM +0200, Oleg Nesterov wrote:
> On 07/15, Paul E. McKenney wrote:
> >
> > On Fri, Jul 15, 2016 at 06:49:39PM +0200, Oleg Nesterov wrote:
> > >
> > > IOW, please ignore 2/2 which adds PERCPU_RWSEM_READER, the new version
> > > just adds rcu_sync_sabotage() which should be renamed (and use GP_PASSED).
> >
> > OK, then move the checks out into the callers that would have used
> > __rcu_sync_enter(). ;-)
>
> Cough, now I can't understand which check do you mean ;) OK, let me
> show the code, then we will hopefully understand each other.

The check in an earlier patch having to do with some _NONE value. I don't
see any such checks here. Which might be a good thing, who knows? ;-)

> ----------------------------------------------------------------------
> So, as you can see I have fooled you ;) I'll send the patch on top of
> Peter's changes, this is the (UNTESTED) code with the patch applied.
>
> Peter, Paul, could you review? Do you see any hole?

Please see below. I see what looks like a missing wakeup and something
that could fail to wait for pre-existing RCU readers. At the very end,
I have a somewhat more elaborate state diagram, along with assumptions
I made during review. Please do check these assumptions carefully.

And the state table, for that matter...

> Why. Firstly, note that the state machine was greatly simplified, and
> rsp->cb_state has gone, we have a single "state" variable, gp_state.
> Note also the ->sync() op has gone (and actually ->wait() too, see the
> "TODO" comment).
>
> GP_IDLE - owned by __rcu_sync_enter() which can only move this state to
>
> GP_ENTER - owned by rcu-callback which moves it to
>
> GP_PASSED - owned by the last rcu_sync_exit() which moves it to
>
> GP_EXIT - owned by rcu-callback which moves it back to GP_IDLE.
>
> Yes, this is a bit simplified, we also have GP_REPLAY, but hopefully
> clear.

Agreed, given that you are ignoring GP_REPLAY.

> And, there is another possible transition, GP_ENTER -> GP_IDLE, because
> not it is possible to call __rcu_sync_enter() and rcu_sync_exit() in any
> state (except obviously they should be balanced), and they do not block.

Hmmm... I am assuming that the only threads allowed to invoke
rcu_sync_exit() are those that have executed an unpaired rcu_sync_enter().
With that assumption, rcu_sync_enter() is illegal if GP_IDLE or GP_ENTER,
because the rcu_sync_enter() calls won't return until the state reaches
GP_PASSED.

If you feel strongly about allowing rcu_sync_exit() in GP_ENTER state,
could you please tell me your use case? Or am I confused?

> The only blocking call is __rcu_sync_wait() which actually waits for GP.
> Obviously should only be called if gp_count != 0, iow after __rcu_sync_enter.

Well, rcu_sync_dtor() as well. But to your point, it is blocking in
rcu_barrier(), not in wait_event().

> --------------------------------------------------------------------------
> Now, cgroup_init() can simply call __rcu_sync_enter(&cgroup_threadgroup_rwsem)
> and switch this sem into the slow mode. Compared to "sabotage" from Peter
> this implies the unnecessary call_rcu_sched(), but I hope we can tolerate
> this.

Or you could add a flag to rcu_sync_init() and friends, but I vaguely
recall some objection to that.

> And we can even add a runtime knob to switch between "fast" and "slow
> aka writer-biased" modes for cgroup_threadgroup_rwsem.

This might well be the check that I was objecting to earlier. ;-)

> --------------------------------------------------------------------------
> And I think __rcu_sync_enter() can have more users. Let's look at
> freeze_super(). It calls percpu_down_write() 3 times, and waits for 3 GP's
> sequentally.
>
> Now we can add 3 __rcu_sync_enter's at the start and 3 rcu_sync_exit's at
> the end (actually we can do better, just to simplify). And again, note
> that rcu_sync_exit() will work correctly even if we (say) return -EBUSY,
> so rcu_sync_wait and/or percpu_down_write() was not called in between,
> and in this case we won't block waiting for GP.
>
> What do you think?

I am not going to claim to understand freeze_super(), but it does seem
to have a fair amount of waiting.

But yes, you could put rcu_sync_enter() and rcu_sync_exit() before and
after a series of write-side enter/exit pairs in order to force things
to stay in writer mode, if that is what you are suggesting.

> Oleg.
> ---
>
> // rcu_sync.h: ----------------------------------------------------------------
>
> struct rcu_sync {
> int gp_state;
> int gp_count;
> wait_queue_head_t gp_wait;
>
> struct rcu_head cb_head;
> enum rcu_sync_type gp_type;
> };
>
> // sync.c ---------------------------------------------------------------------
> #include <linux/rcu_sync.h>
> #include <linux/sched.h>
>
> #ifdef CONFIG_PROVE_RCU
> #define __INIT_HELD(func) .held = func,
> #else
> #define __INIT_HELD(func)
> #endif
>
> static const struct {
> void (*call)(struct rcu_head *, void (*)(struct rcu_head *));
> void (*wait)(void); // TODO: remove this, see the comment in dtor
> #ifdef CONFIG_PROVE_RCU
> int (*held)(void);
> #endif
> } gp_ops[] = {
> [RCU_SYNC] = {
> .call = call_rcu,
> .wait = rcu_barrier,
> __INIT_HELD(rcu_read_lock_held)
> },
> [RCU_SCHED_SYNC] = {
> .call = call_rcu_sched,
> .wait = rcu_barrier_sched,
> __INIT_HELD(rcu_read_lock_sched_held)
> },
> [RCU_BH_SYNC] = {
> .call = call_rcu_bh,
> .wait = rcu_barrier_bh,
> __INIT_HELD(rcu_read_lock_bh_held)
> },
> };
>
> #define rss_lock gp_wait.lock

Interesting... Should there instead be a wait_queue_lock() and
wait_queue_unlock()? If someone changes how wait-queue locking work,
this one might be a bit tricky.

> enum { GP_IDLE = 0, GP_ENTER, GP_PASSED, GP_EXIT, GP_REPLAY };
>
> static void rcu_sync_func(struct rcu_head *rcu);
>
> static void rcu_sync_call(struct rcu_sync *rsp)
> {
> // TODO: THIS IS SUBOPTIMAL. We want to call it directly
> // if rcu_blocking_is_gp() == T, but it has might_sleep().

Careful! This optimization works only for RCU-sched and RCU-bh.
With normal RCU, you can be tripped up by tasks preempted within RCU
read-side critical sections should CONFIG_PREEMPT=y.

> gp_ops[rsp->gp_type].call(&rsp->cb_head, rcu_sync_func);
> }
>
> static void rcu_sync_func(struct rcu_head *rcu)
> {
> struct rcu_sync *rsp = container_of(rcu, struct rcu_sync, cb_head);
> unsigned long flags;
>
> BUG_ON(rsp->gp_state == GP_IDLE);

If you really want rcu_sync_exit() to be invoked before the corresponding
rcu_sync_enter() returns, and if you want that to set the state to
GP_IDLE, then the above BUG_ON() can fire. This is why I was assuming
that rcu_sync_enter() must return before the corresponding rcu_sync_exit()
can start.

> BUG_ON(rsp->gp_state == GP_PASSED);

READ_ONCE() would feel better in both of the above BUG_ON()s, given that
we are not yet holding the lock.

> spin_lock_irqsave(&rsp->rss_lock, flags);
> if (rsp->gp_count) {
> /*
> * We're at least a GP after the first __rcu_sync_enter().
> */

Don't we need a wakeup here?

> rsp->gp_state = GP_PASSED;
> } else if (rsp->gp_state == GP_REPLAY) {
> /*
> * A new rcu_sync_exit() has happened; requeue the callback
> * to catch a later GP.
> */
> rsp->gp_state = GP_EXIT;
> rcu_sync_call(rsp);
> } else {
> /*
> * We're at least a GP after the last rcu_sync_exit();
> * EVeybody will now have observed the write side critical
> * section. Let 'em rip!.
> *
> * OR. ->gp_state can be still GP_ENTER if __rcu_sync_wait()
> * wasn't called after __rcu_sync_enter(), abort.
> */
> rsp->gp_state = GP_IDLE;
> }
> spin_unlock_irqrestore(&rsp->rss_lock, flags);
> }
>
> bool __rcu_sync_enter(struct rcu_sync *rsp)
> {
> int gp_count, gp_state;
>
> spin_lock_irq(&rsp->rss_lock);
> gp_count = rsp->gp_count++;
> gp_state = rsp->gp_state;
> if (gp_state == GP_IDLE) {
> rsp->gp_state = GP_ENTER;
> rcu_sync_call(rsp);
> }
> spin_unlock_irq(&rsp->rss_lock);
>
> BUG_ON(gp_count != 0 && gp_state == GP_IDLE);
> BUG_ON(gp_count == 0 && gp_state == GP_PASSED);

Isn't gp_count==0 illegal here regardless of gp_state value?

> return gp_state < GP_PASSED;

And the above statement is another reason why I believe that it
should be illegal to invoke rcu_sync_exit() until after the matching
rcu_sync_enter() has returned. If we were preempted just before the
above "return" statement, we might be in GP_IDLE state upon return,
which could fatally disappoint the caller.

And ditto for the next BUG_ON().

> }
>
> void __rcu_sync_wait(struct rcu_sync *rsp)
> {
> BUG_ON(rsp->gp_state == GP_IDLE);
> BUG_ON(rsp->gp_count == 0);
>
> wait_event(rsp->gp_wait, rsp->gp_state >= GP_PASSED);

I would feel better if all the references to ->gp_state and ->gp_count
outside of the lock be READ_ONCE(). Compilers can be tricky beasts.

> }

Here is what I believe rcu_sync_enter() does:

rcu_sync_enter() checks to see if the rcu_sync structure is in writer
mode, and, if not, initiates a transition to writer mode and waits for
that transition to complete. Either way, a writer reference is acquired.

> void rcu_sync_enter(struct rcu_sync *rsp)
> {
> if (__rcu_sync_enter(rsp))
> __rcu_sync_wait(rsp);
> }

Here is what I believe rcu_sync_exit() does:

rcu_sync_exit() removes an updater reference, and if there are no more,
starts the transition to reader mode. Note that this must handle
transitions back to writer mode that occur before the transition to
reader mode has fully completed.

> void rcu_sync_exit(struct rcu_sync *rsp)
> {
> BUG_ON(rsp->gp_state == GP_IDLE);

BUG_ON(READ_ONCE(rsp->gp_state) == GP_ENTER);

If you do allow rcu_sync_exit() while GP_ENTER, then all the readers
can do rcu_sync_exit() in that state, leaving the state at GP_PASSED
instead of the desired GP_IDLE.

> BUG_ON(rsp->gp_count == 0);
>
> spin_lock_irq(&rsp->rss_lock);
> if (!--rsp->gp_count) {
> if (rsp->gp_state == GP_PASSED) {
> rsp->gp_state = GP_EXIT;
> rcu_sync_call(rsp);
> } else if (rsp->gp_state == GP_EXIT) {
> rsp->gp_state = GP_REPLAY;
> }
> }
> spin_unlock_irq(&rsp->rss_lock);
> }

And here is what I believe rcu_sync_dtor() does:

rcu_sync_dtor() cleans up at end, waiting for a full transition back
into reader mode. Anything invoking rcu_sync_dtor() must have ensured
that there will be no more calls to rcu_sync_enter() and rcu_sync_exit().

> void rcu_sync_dtor(struct rcu_sync *rsp)
> {
> int gp_state;
>
> BUG_ON(rsp->gp_count);
> BUG_ON(rsp->gp_state == GP_PASSED);
>
> spin_lock_irq(&rsp->rss_lock);
> if (rsp->gp_state == GP_REPLAY)
> rsp->gp_state = GP_EXIT;

OK, this ensures that the .wait() below will wait for the callback, but
it might result in some RCU read-side critical sections still being in
flight after rcu_sync_dtor() completes. In other words, any transition
started by an rcu_sync_exit() will normally wait until all pre-existing
RCU readers complete -- unless that rcu_sync_exit() set the state to
GP_REPLAY and was quickly followed by rcu_sync_dtor().

Is this really OK???

You might get a more robust API if you left the value of ->gp_state
alone and rechecked ->gp_state below (under the lock). The additional
overhead is incurred only at cleanup, so should not be a problem.

Or am I missing some subtle constraint on use cases here?

> gp_state = rsp->gp_state;
> spin_unlock_irq(&rsp->rss_lock);
>
> // TODO: add another wake_up_locked() into rcu_sync_func(),

We need to add the first wakeup before we can add another one. ;-)

But I don't see why we need a second wakeup, given that rcu_barrier()
wakes itself up.

> // use wait_event + spin_lock_wait, remove gp_ops->wait().
> if (gp_state != GP_IDLE) {
> gp_ops[rsp->gp_type].wait();
> BUG_ON(rsp->gp_state != GP_IDLE);
> }
> }
>

And please see below for the state table and assumptions.

So, what am I missing?

Thanx, Paul

------------------------------------------------------------------------

o "count" is the value of the rcu_sync structure's ->gp_count field.

o "state" is the value of the rcu_sync structure's ->gp_state field.

o "CB?" is "yes" if there is an RCU callback pending and "no" otherwise.

| count | state | CB? | next state
---+-------+-----------+-----+------------------------------------
| | | |
A | 0 | GP_IDLE | no | rcu_sync_enter() -> B (wait)
| | | | rcu_sync_exit() -> illegal
| | | | rcu_sync_dtor() -> H
| | | | callback -> cannot happen
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
B | 1+ | GP_ENTER | yes | rcu_sync_enter() -> B (wait)
| | | | rcu_sync_exit() -> illegal
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> C (ends _enter() wait)
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
C | 1+ | GP_PASSED | no | rcu_sync_enter() -> C
| | | | last rcu_sync_exit() -> E
| | | | non-last rcu_sync_exit() -> C
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> cannot happen
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
D | 0 | GP_REPLAY | yes | rcu_sync_enter() -> F
| | | | rcu_sync_exit() -> illegal
| | | | rcu_sync_dtor() -> I (wait ???)
| | | | callback -> E
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
E | 0 | GP_EXIT | yes | rcu_sync_enter() -> G
| | | | rcu_sync_exit() -> illegal
| | | | rcu_sync_dtor() -> I (wait)
| | | | callback -> A
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
F | 1+ | GP_REPLAY | yes | rcu_sync_enter() -> F
| | | | last rcu_sync_exit() -> D
| | | | non-last rcu_sync_exit() -> F
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> C
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
G | 1+ | GP_EXIT | yes | rcu_sync_enter() -> G
| | | | last rcu_sync_exit() -> D
| | | | non-last rcu_sync_exit() -> F
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> C
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
H | 0 | GP_IDLE | no | rcu_sync_enter() -> illegal
| | | | rcu_sync_exit() -> illegal
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> cannot happen
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
I | 0 | GP_EXIT | yes | rcu_sync_enter() -> illegal
| | | | rcu_sync_exit() -> illegal
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> H (ends dtor() wait)

Assumptions:

o Initial state is A.

o Final state is H.

o There will never be enough unpaired rcu_sync_enter() calls to
overflow ->gp_count.

o All calls to rcu_sync_exit() must pair with a preceding call
to rcu_sync_enter() by that same thread.

o It is illegal to invoke rcu_sync_dtor() until after the caller
has ensured that there will be no future calls to either
rcu_sync_enter() or rcu_sync_exit().

o It is illegal to invoke rcu_sync_dtor() while there are any
unpaired calls to rcu_sync_enter().

Next message: RafaÅ MiÅecki: "[PATCH 1/2] mtd: add support for partition parsers"
Previous message: Rustad, Mark D: "Re: [Intel-wired-lan] [PATCH net] e1000e: fix PTP on e1000_pch_lpt variants"
In reply to: Oleg Nesterov: "Re: [PATCH] rcu_sync: simplify the state machine, introduce __rcu_sync_enter()"
Next in thread: Oleg Nesterov: "Re: [PATCH] rcu_sync: simplify the state machine, introduce __rcu_sync_enter()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]