Re: Simplifying our RCU models

From: Eric W. Biederman
Date: Mon Mar 05 2018 - 09:34:15 EST



Moving this discussion to a public list as discussing how to reduce the
number of rcu variants does not make sense in private. We should have
an archive of such discussions.

Ingo Molnar <mingo@xxxxxxxxxx> writes:

> * Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
>
>> > So if people really want that low-cost RCU, and some people really
>> > need the sleepable version, the only one that can _possibly_ be dumped
>> > is the preempt one.
>> >
>> > But I may - again - be confused and/or missing something.
>>
>> I am going to do something very stupid and say that I was instead thinking in
>> terms of getting rid of RCU-bh, thus reminding you of its existence. ;-)
>>
>> The reason for believing that it is possible to get rid of RCU-bh is the work
>> that has gone into improving the forward progress of RCU grace periods under
>> heavy load and in corner-case workloads.
>>
>
> [...]
>
>> The other reason for RCU-sched is it has the side effect of waiting
>> for all in-flight hardware interrupt handlers, NMI handlers, and
>> preempt-disable regions of code to complete, and last I checked, this side
>> effect is relied on. In contrast, RCU-preeempt is only guaranteed to wait
>> on regions of code protected by rcu_read_lock() and rcu_read_unlock().
>
> Instead of only trying to fix the documentation (which is never a bad idea but it
> is fighting the symptom in this case), I think the first step should be to
> simplify the RCU read side APIs of RCU from 4 APIs:
>
> rcu_read_lock()
> srcu_read_lock()
> rcu_read_lock_sched()
> rcu_read_lock_bh()
>
> ... which have ~8 further sub-model variations depending on CONFIG_PREEMPT,
> CONFIG_PREEMPT_RCU - which is really a crazy design!
>
> I think we could reduce this to just two APIs with no Kconfig dependencies:
>
> rcu_read_lock()
> rcu_read_lock_preempt_disable()
>
> Which would be much, much simpler.
>
> This is how we could do it I think:
>
> 1)
>
> Getting rid of the _bh() variant should be reasonably simple and involve a
> treewide replacement of:
>
> rcu_read_lock_bh() -> local_bh_disable()
> rcu_read_unlock_bh() -> local_bh_enable()
>
> Correct?
>
> 2)
>
> Further reducing the variants is harder, due to this main asymmetry:
>
> !PREEMPT_RCU PREEMPT_RCU=y
> rcu_read_lock_sched(): atomic atomic
> rcu_read_lock(): atomic preemptible
>
> ('atomic' here is meant in the scheduler, non-preemptible sense.)
>
> But if we look at the bigger API picture:
>
> !PREEMPT_RCU PREEMPT_RCU=y
> rcu_read_lock(): atomic preemptiblep
> rcu_read_lock_sched(): atomic atomic
> srcu_read_lock(): preemptible preemptible
>
> Then we could maintain full read side API flexibility by making PREEMPT_RCU=y the
> only model, merging it with SRCU and using these main read side APIs:
>
> rcu_read_lock_preempt_disable((): atomic
> rcu_read_lock() preemptible
>
> It's a _really_ simple and straightforward RCU model, with very obvious semantics
> all around:
>
> - Note how the 'atomic' (non-preempt) variant uses the well-known
> preempt_disable() name as a postfix to signal its main property. (It's also a
> bit of a mouthful, which should discourage over-use.)
>
> - The read side APIs are really as straightforward as possible: there's no SRCU
> distinction on the read side, no _bh() distinction and no _sched() distinction.
> (On -rt all of these would turn into preemptible sections,
> obviously.)

And it looses the one advantage of srcu_read_lock. That you don't have
to wait for the entire world. If you actually allow sleeping that is an
important distinction to have. Or are you proposing that we add the
equivalent of init_srcu_struct to all of the rcu users?

That rcu_read_lock would need to take an argument about which rcu region
we are talking about.

> rcu_read_lock_preempt_disable() would essentially be all the current
> rcu_read_lock_sched() users (where the _sched() postfix was a confusing misnomer
> anyway).
>
> Wrt. merging SRCU and RCU: this can be done by making PREEMPT_RCU=y the one and
> only main RCU model and converting all SRCU users to main RCU. This is relatively
> straightforward to perform, as there are only ~170 SRCU critical sections, versus
> the 3000+ main RCU critical sections ...

It really sounds like you are talking about adding a requirement that
everyone update their rcu_read_lock() calls with information about which
region you are talking about. That seems like quite a bit of work.

Doing something implicit when PREEMPT_RCU=y and converting
"rcu_read_lock()" to "srcu_read_lock(&kernel_srcu_region)" only in that
case I can see.

Except in very specific circustances I don't think I ever want to run a
kernel with PREEMPT_RCU the default. All of that real time stuff trades
off predictability with performance. Having lost enough performance to
spectre and meltdown I don't think it makes sense for us all to start
runing predictable^H^H^H^H^H^H^H^H^H^H^H time kernels now.

> AFAICS this should be a possible read side design that keeps correctness, without
> considering grace period length patterns, i.e. without considering GC latency and
> scalability aspects.
>
> Before we get into ways to solve the latency and scalability aspects of such a
> simplified RCU model, do you agree with this analysis so far, or have I missed
> something important wrt. correctness?

RCU region specification. If we routinely allow preemption of rcu
critical sections for any length of time I can't imagine we will want to
wait for every possible preempted rcu critical section.

Of course I could see the merge working the other way. Adding the
debugging we need to find rcu critical secions that are held to long and
shrinking them so we don't need PREEMPT_RCU at all.

Eric