Re: Simplifying our RCU models

From: Paul E. McKenney
Date: Mon Mar 05 2018 - 11:14:26 EST


On Mon, Mar 05, 2018 at 08:33:20AM -0600, Eric W. Biederman wrote:
>
> Moving this discussion to a public list as discussing how to reduce the
> number of rcu variants does not make sense in private. We should have
> an archive of such discussions.
>
> Ingo Molnar <mingo@xxxxxxxxxx> writes:
>
> > * Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> >
> >> > So if people really want that low-cost RCU, and some people really
> >> > need the sleepable version, the only one that can _possibly_ be dumped
> >> > is the preempt one.
> >> >
> >> > But I may - again - be confused and/or missing something.
> >>
> >> I am going to do something very stupid and say that I was instead thinking in
> >> terms of getting rid of RCU-bh, thus reminding you of its existence. ;-)
> >>
> >> The reason for believing that it is possible to get rid of RCU-bh is the work
> >> that has gone into improving the forward progress of RCU grace periods under
> >> heavy load and in corner-case workloads.
> >>
> >
> > [...]
> >
> >> The other reason for RCU-sched is it has the side effect of waiting
> >> for all in-flight hardware interrupt handlers, NMI handlers, and
> >> preempt-disable regions of code to complete, and last I checked, this side
> >> effect is relied on. In contrast, RCU-preeempt is only guaranteed to wait
> >> on regions of code protected by rcu_read_lock() and rcu_read_unlock().
> >
> > Instead of only trying to fix the documentation (which is never a bad idea but it
> > is fighting the symptom in this case), I think the first step should be to
> > simplify the RCU read side APIs of RCU from 4 APIs:
> >
> > rcu_read_lock()
> > srcu_read_lock()
> > rcu_read_lock_sched()
> > rcu_read_lock_bh()
> >
> > ... which have ~8 further sub-model variations depending on CONFIG_PREEMPT,
> > CONFIG_PREEMPT_RCU - which is really a crazy design!

If it is possible to set CONFIG_PREEMPT_RCU differently than CONFIG_PREEMPT,
then that is a bug that I need to fix.

> > I think we could reduce this to just two APIs with no Kconfig dependencies:
> >
> > rcu_read_lock()
> > rcu_read_lock_preempt_disable()
> >
> > Which would be much, much simpler.

No argument on the simpler part, at least from an API perspective.

> > This is how we could do it I think:
> >
> > 1)
> >
> > Getting rid of the _bh() variant should be reasonably simple and involve a
> > treewide replacement of:
> >
> > rcu_read_lock_bh() -> local_bh_disable()
> > rcu_read_unlock_bh() -> local_bh_enable()
> >
> > Correct?

Assuming that I have done enough forward-progress work on grace periods, yes.

> > 2)
> >
> > Further reducing the variants is harder, due to this main asymmetry:
> >
> > !PREEMPT_RCU PREEMPT_RCU=y
> > rcu_read_lock_sched(): atomic atomic
> > rcu_read_lock(): atomic preemptible
> >
> > ('atomic' here is meant in the scheduler, non-preemptible sense.)
> >
> > But if we look at the bigger API picture:
> >
> > !PREEMPT_RCU PREEMPT_RCU=y
> > rcu_read_lock(): atomic preemptiblep
> > rcu_read_lock_sched(): atomic atomic
> > srcu_read_lock(): preemptible preemptible
> >
> > Then we could maintain full read side API flexibility by making PREEMPT_RCU=y the
> > only model, merging it with SRCU and using these main read side APIs:
> >
> > rcu_read_lock_preempt_disable((): atomic
> > rcu_read_lock() preemptible

One issue with merging SRCU into rcu_read_lock() is the general blocking
within SRCU readers. Once merged in, these guys block everyone. We should
focus initially on the non-SRCU variants.

On the other hand, Linus's suggestion of merging rcu_read_lock_sched()
into rcu_read_lock() just might be feasible. If that really does pan
out, we end up with the following:

!PREEMPT PREEMPT=y
rcu_read_lock(): atomic preemptible
srcu_read_lock(): preemptible preemptible

In this model, rcu_read_lock_sched() maps to preempt_disable() and (as
you say above) rcu_read_lock_bh() maps to local_bh_disable(). The way
this works is that in PREEMPT=y kernels, synchronize_rcu() waits not
only for RCU read-side critical sections, but also for regions of code
with preemption disabled. The main caveat seems to be that there be an
assumed point of preemptibility between each interrupt and each softirq
handler, which should be OK.

There will be some adjustments required for lockdep-RCU, but that should
be reasonably straightforward.

Seem reasonable?

> > It's a _really_ simple and straightforward RCU model, with very obvious semantics
> > all around:
> >
> > - Note how the 'atomic' (non-preempt) variant uses the well-known
> > preempt_disable() name as a postfix to signal its main property. (It's also a
> > bit of a mouthful, which should discourage over-use.)

My thought is to eliminate the atomic variant entirely. If you want
to disable preemption, interrupts, or whatever, you simply do so.
It might turn out that there are documentation benefits to having a
separate rcu_read_lock_preempt_disable() that maps to preempt_disable()
with lockdep semantics, and if so, that can be provided trivially.

> > - The read side APIs are really as straightforward as possible: there's no SRCU
> > distinction on the read side, no _bh() distinction and no _sched() distinction.
> > (On -rt all of these would turn into preemptible sections,
> > obviously.)

Agreed, and both models accomplish that.

> And it looses the one advantage of srcu_read_lock. That you don't have
> to wait for the entire world. If you actually allow sleeping that is an
> important distinction to have. Or are you proposing that we add the
> equivalent of init_srcu_struct to all of the rcu users?

I am instead proposing folding rcu_read_lock_bh() and rcu_read_lock_sched()
into rcu_read_lock(), and leaving srcu_read_lock() separate.

> That rcu_read_lock would need to take an argument about which rcu region
> we are talking about.

>From what I can see, it would be far better to leave SRCU separate. As you
say, it really does have very different semantics.

> > rcu_read_lock_preempt_disable() would essentially be all the current
> > rcu_read_lock_sched() users (where the _sched() postfix was a confusing misnomer
> > anyway).

I agree that rcu_read_lock_preempt_disable() is a better name.
We might not need it at all, though. There are only about 20 uses of
rcu_read_lock_sched() in v4.15. ;-)

> > Wrt. merging SRCU and RCU: this can be done by making PREEMPT_RCU=y the one and
> > only main RCU model and converting all SRCU users to main RCU. This is relatively
> > straightforward to perform, as there are only ~170 SRCU critical sections, versus
> > the 3000+ main RCU critical sections ...
>
> It really sounds like you are talking about adding a requirement that
> everyone update their rcu_read_lock() calls with information about which
> region you are talking about. That seems like quite a bit of work.

Agreed, merging RCU, RCU-bh, and RCU-sched seems much more straightforward
to me from the viewpoint of both usage and implementation.

> Doing something implicit when PREEMPT_RCU=y and converting
> "rcu_read_lock()" to "srcu_read_lock(&kernel_srcu_region)" only in that
> case I can see.
>
> Except in very specific circustances I don't think I ever want to run a
> kernel with PREEMPT_RCU the default. All of that real time stuff trades
> off predictability with performance. Having lost enough performance to
> spectre and meltdown I don't think it makes sense for us all to start
> runing predictable^H^H^H^H^H^H^H^H^H^H^H time kernels now.

Yes, in PREEMPT=n kernels RCU would act exactly as it does today.

> > AFAICS this should be a possible read side design that keeps correctness, without
> > considering grace period length patterns, i.e. without considering GC latency and
> > scalability aspects.
> >
> > Before we get into ways to solve the latency and scalability aspects of such a
> > simplified RCU model, do you agree with this analysis so far, or have I missed
> > something important wrt. correctness?
>
> RCU region specification. If we routinely allow preemption of rcu
> critical sections for any length of time I can't imagine we will want to
> wait for every possible preempted rcu critical section.
>
> Of course I could see the merge working the other way. Adding the
> debugging we need to find rcu critical secions that are held to long and
> shrinking them so we don't need PREEMPT_RCU at all.

Again, from what I can see, merging rcu_read_lock(), rcu_read_lock_sched(),
and rcu_read_lock_bh() together should get us to a much better place.

Make sense, or am I missing something?

Thanx, Paul