Re: [PATCH] rcu_sync: simplify the state machine, introduce __rcu_sync_enter()

From: Paul E. McKenney
Date: Wed Jul 20 2016 - 17:31:32 EST


On Wed, Jul 20, 2016 at 07:16:03PM +0200, Oleg Nesterov wrote:
> Paul, I had to switch to internal bugzillas after the first email, and now
> I feel I can't read... I'll try to answer one question right now, tomorrow
> I'll reread your email, probably I need to answer something else...

I know that feeling of distraction!

> On 07/19, Paul E. McKenney wrote:
> >
> > On Sat, Jul 16, 2016 at 07:10:07PM +0200, Oleg Nesterov wrote:
> >
> > > And, there is another possible transition, GP_ENTER -> GP_IDLE, because
> > > not it is possible to call __rcu_sync_enter() and rcu_sync_exit() in any
> > > state (except obviously they should be balanced), and they do not block.
> ...
> > If you feel strongly about allowing rcu_sync_exit() in GP_ENTER state,
> > could you please tell me your use case? Or am I confused?
>
> See below,
>
> > And I think __rcu_sync_enter() can have more users. Let's look at
> > freeze_super(). It calls percpu_down_write() 3 times, and waits for 3 GP's
> > sequentally.
> >
> > > Now we can add 3 __rcu_sync_enter's at the start and 3 rcu_sync_exit's at
> > > the end (actually we can do better, just to simplify). And again, note
> > > that rcu_sync_exit() will work correctly even if we (say) return -EBUSY,
> > > so rcu_sync_wait and/or percpu_down_write() was not called in between,
> > > and in this case we won't block waiting for GP.
> > >
> >
> > I am not going to claim to understand freeze_super(), but it does seem
> > to have a fair amount of waiting.
> >
> > But yes, you could put rcu_sync_enter()
> ^^^^^^^^^^^^^^
> __rcu_sync_enter, see below,
>
> > and rcu_sync_exit() before and
> > after a series of write-side enter/exit pairs in order to force things
> > to stay in writer mode, if that is what you are suggesting.
>
> No, no, this is not what I am trying to suggest.
>
> The problem is that freeze_super() takes 3 semaphores for writing in row,
> this means that it needs to wait for 3 GP's sequentally, and it does this
> with sb->s_umount held. This is just ugly.

Slow, too. ;-)

> OK, lets suppose it simply does
>
> freeze_super(sb)
> {
> down_write(&sb->s_umount);
> if (NEED_TO_FREEZE) {
> percpu_down_write(SEM1);
> percpu_down_write(SEM2);
> percpu_down_write(SEM3);
> }
> up_write(&sb->s_umount);
> }
>
> and every percpu_down_write() waits for GP.
>
> Now, suppose we add the additional enter/exit's:
>
> freeze_super(sb)
> {
> // this doesn't block
> __rcu_sync_enter(SEM3);
> __rcu_sync_enter(SEM2);
> __rcu_sync_enter(SEM1);
>
> down_write(&sb->s_umount);
> if (NEED_TO_FREEZE) {
> percpu_down_write(SEM1);

The above waits for the grace period initiated by __rcu_sync_enter(),
correct? Presumably "yes", because it will invoke rcu_sync_enter(), which
will see the state as GP_ENTER, and will thus wait. Or am I confused?

> percpu_down_write(SEM2);
> percpu_down_write(SEM3);
> }
> up_write(&sb->s_umount);
>
> rcu_sync_exit(SEM1);

But your point is that if !NEED_TO_FREEZE, we will get here without
waiting for a grace period.

But why aren't the __rcu_sync_enter() and rcu_sync_exit() calls inside
the "if" statement?

> rcu_sync_exit(SEM2);
> rcu_sync_exit(SEM3);

And the reason it is OK to invoke rcu_sync_exit() before doing
the percpu_up_write() calls is that those calls will do their own
rcu_sync_exit() calls, and the above calls really don't do anything
other than decrement the count.

>
> }
>
> Again, actually we can do better, just to simplify.
>
> Now. the fisrt percpu_down_write(SEM1) can block waiting for GP or not,
> this depends on how many time it spends in down_write().
>
> But the 2nd and the 3rd percpu_down_write() most likely won't block, so
> in the likely case freeze_super() will need a single GP pass.
>
> And note that NEED_TO_FREEZE can be false, in this case rcu_sync_exit()
> will be called in GP_ENTER state.

Understood, the above approach allows the three grace periods to
elapse concurrently, at least assuming RCU is busy at the time.
(If RCU is not busy, the first one gets its own grace period, and
the other two share the next grace period.)

But again why aren't the __rcu_sync_enter() and rcu_sync_exit() calls
inside that "if" statement?

That aside, would it make sense to name __rcu_sync_enter() something
like rcu_sync_begin_to_enter(), rcu_sync_pre_enter() or some such?
Something to make it clear that it just starts the job and that something
else is needed to finish it.

> To some degree, this is like get_state_synchronize_rcu/cond_synchronize_rcu.
> But obviously percpu_down_write() can not use these helpers, and in this
> particular case __rcu_sync_enter() is better because it forces the start
> of GP pass.

OK, I now understand the motivation, thank you for your patience!

> -----------------------------------------------------------------------
> As for cgroups, we want to switch cgroup_threadgroup_rwsem into the
> slow mode, at least for now.
>
> We could add the additional hooks/hacks into rcu/sync.c but why? We can
> do this without any changes outside of cgroup.c right now, just add
> rcu_sync_enter() into cgroup_init().
>
> But we do not want to add a pointless synchronize_sched() at boot time,
> __rcu_sync_enter() looks much better.

I do like that approach much better than an rcu_sync_sabotage().

I still have concern with the __rcu_sync_enter() name, and would prefer
something that clearly indicates that it starts the rcu_sync() entry
job, but does not synchronously complete it.

> ------------------------------------------------------------------------
> And even __cgroup_procs_write() could use __rcu_sync_enter(). But lets
> ignore this for now.

And here is an updated state table. I do not yet separately call out
__rcu_sync_enter(), though without it the rcu_sync_exit() transition
out of state B cannot happen.

Thanx, Paul

------------------------------------------------------------------------

o "count" is the value of the rcu_sync structure's ->gp_count field.

o "state" is the value of the rcu_sync structure's ->gp_state field.

o "CB?" is "yes" if there is an RCU callback pending and "no" otherwise.



| count | state | CB? | next state
---+-------+-----------+-----+------------------------------------
| | | |
A | 0 | GP_IDLE | no | rcu_sync_enter() -> B (wait)
| | | | rcu_sync_exit() -> illegal
| | | | rcu_sync_dtor() -> H
| | | | callback -> cannot happen
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
B | 1+ | GP_ENTER | yes | rcu_sync_enter() -> B (wait)
| | | | last rcu_sync_exit() -> J
| | | | non-last rcu_sync_exit() -> B
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> C (ends _enter() wait)
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
C | 1+ | GP_PASSED | no | rcu_sync_enter() -> C
| | | | last rcu_sync_exit() -> E
| | | | non-last rcu_sync_exit() -> C
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> cannot happen
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
D | 0 | GP_REPLAY | yes | rcu_sync_enter() -> F
| | | | rcu_sync_exit() -> illegal
| | | | rcu_sync_dtor() -> I (wait ???)
| | | | callback -> E
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
E | 0 | GP_EXIT | yes | rcu_sync_enter() -> G
| | | | rcu_sync_exit() -> illegal
| | | | rcu_sync_dtor() -> I (wait)
| | | | callback -> A
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
F | 1+ | GP_REPLAY | yes | rcu_sync_enter() -> F
| | | | last rcu_sync_exit() -> D
| | | | non-last rcu_sync_exit() -> F
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> C
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
G | 1+ | GP_EXIT | yes | rcu_sync_enter() -> G
| | | | last rcu_sync_exit() -> D
| | | | non-last rcu_sync_exit() -> F
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> C
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
H | 0 | GP_IDLE | no | rcu_sync_enter() -> illegal
| | | | rcu_sync_exit() -> illegal
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> cannot happen
| | | |
---+-------+-----------+-----+------------------------------------
| | | |
I | 0 | GP_EXIT | yes | rcu_sync_enter() -> illegal
| | | | rcu_sync_exit() -> illegal
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> H (ends dtor() wait)
---+-------+-----------+-----+------------------------------------
| | | |
J | 0 | GP_ENTER | yes | rcu_sync_enter() -> B (wait)
| | | | rcu_sync_exit() -> illegal
| | | | rcu_sync_dtor() -> illegal
| | | | callback -> A
| | | |


Assumptions:

o Initial state is A.

o Final state is H.

o There will never be enough unpaired rcu_sync_enter() calls to
overflow ->gp_count.

o All calls to rcu_sync_exit() must pair with a preceding call
to rcu_sync_enter() by that same thread.

o It is illegal to invoke rcu_sync_dtor() until after the caller
has ensured that there will be no future calls to either
rcu_sync_enter() or rcu_sync_exit().

o It is illegal to invoke rcu_sync_dtor() while there are any
unpaired calls to rcu_sync_enter().