Re: [patch V6 12/37] x86/entry: Provide idtentry_entry/exit_cond_rcu()
From: Paul E. McKenney
Date: Wed May 20 2020 - 14:11:45 EST
On Wed, May 20, 2020 at 10:47:29AM -0700, Andy Lutomirski wrote:
> On Wed, May 20, 2020 at 10:38 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> >
> > On Wed, May 20, 2020 at 08:36:06AM -0700, Andy Lutomirski wrote:
> > > On Tue, May 19, 2020 at 7:23 PM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
> > > > On Tue, May 19, 2020 at 05:26:58PM -0700, Andy Lutomirski wrote:
> > > > > On Tue, May 19, 2020 at 2:20 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> > > > > > Andy Lutomirski <luto@xxxxxxxxxx> writes:
> > > > > > > On Tue, May 19, 2020 at 1:20 PM Thomas Gleixner <tglx@xxxxxxxxxxxxx> wrote:
> > > > > > >> Thomas Gleixner <tglx@xxxxxxxxxxxxx> writes:
> > > > > > >> It's about this:
> > > > > > >>
> > > > > > >> rcu_nmi_enter()
> > > > > > >> {
> > > > > > >> if (!rcu_is_watching()) {
> > > > > > >> make it watch;
> > > > > > >> } else if (!in_nmi()) {
> > > > > > >> do_magic_nohz_dyntick_muck();
> > > > > > >> }
> > > > > > >>
> > > > > > >> So if we do all irq/system vector entries conditional then the
> > > > > > >> do_magic() gets never executed. After that I got lost...
> > > > > > >
> > > > > > > I'm also baffled by that magic, but I'm also not suggesting doing this
> > > > > > > to *all* entries -- just the not-super-magic ones that use
> > > > > > > idtentry_enter().
> > > > > > >
> > > > > > > Paul, what is this code actually trying to do?
> > > > > >
> > > > > > Citing Paul from IRC:
> > > > > >
> > > > > > "The way things are right now, you can leave out the rcu_irq_enter()
> > > > > > if this is not a nohz_full CPU.
> > > > > >
> > > > > > Or if this is a nohz_full CPU, and the tick is already
> > > > > > enabled, in that case you could also leave out the rcu_irq_enter().
> > > > > >
> > > > > > Or even if this is a nohz_full CPU and it does not have the tick
> > > > > > enabled, if it has been in the kernel less than a few tens of
> > > > > > milliseconds, still OK to avoid invoking rcu_irq_enter()
> > > > > >
> > > > > > But my guess is that it would be a lot simpler to just always call
> > > > > > it.
> > > > > >
> > > > > > Hope that helps.
> > > > >
> > > > > Maybe?
> > > > >
> > > > > Unless I've missed something, the effect here is that #PF hitting in
> > > > > an RCU-watching context will skip rcu_irq_enter(), whereas all IRQs
> > > > > (because you converted them) as well as other faults and traps will
> > > > > call rcu_irq_enter().
> > > > >
> > > > > Once upon a time, we did this horrible thing where, on entry from user
> > > > > mode, we would turn on interrupts while still in CONTEXT_USER, which
> > > > > means we could get an IRQ in an extended quiescent state. This means
> > > > > that the IRQ code had to end the EQS so that IRQ handlers could use
> > > > > RCU. But I killed this a few years ago -- x86 Linux now has a rule
> > > > > that, if IF=1, we are *not* in an EQS with the sole exception of the
> > > > > idle code.
> > > > >
> > > > > In my dream world, we would never ever get IRQs while in an EQS -- we
> > > > > would do MWAIT with IF=0 and we would exit the EQS before taking the
> > > > > interrupt. But I guess we still need to support HLT, which means we
> > > > > have this mess.
> > > > >
> > > > > But I still think we can plausibly get rid of the conditional.
> > > >
> > > > You mean the conditional in rcu_nmi_enter()? In a NO_HZ_FULL=n system,
> > > > this becomes:
> > >
> > > So, I meant the conditional in tglx's patch that makes page faults special.
> >
> > OK.
> >
> > > > > If we
> > > > > get an IRQ or (egads!) a fault in idle context, we'll have
> > > > > !__rcu_is_watching(), but, AFAICT, we also have preemption off.
> > > >
> > > > Or we could be early in the kernel-entry code or late in the kernel-exit
> > > > code, but as far as I know, preemption is disabled on those code paths.
> > > > As are interrupts, right? And interrupts are disabled on the portions
> > > > of the CPU-hotplug code where RCU is not watching, if I recall correctly.
> > >
> > > Interrupts are off in the parts of the entry/exit that RCU considers
> > > to be user mode. We can get various faults, although these should be
> > > either NMI-like or events that genuinely or effectively happened in
> > > user mode.
> >
> > Fair enough!
> >
> > > > A nohz_full CPU does not enable the scheduling-clock interrupt upon
> > > > entry to the kernel. Normally, this is fine because that CPU will very
> > > > quickly exit back to nohz_full userspace execution, so that RCU will
> > > > see the quiescent state, either by sampling it directly or by deducing
> > > > the CPU's passage through that quiescent state by comparing with state
> > > > that was captured earlier. The grace-period kthread notices the lack
> > > > of a quiescent state and will eventually set ->rcu_urgent_qs to
> > > > trigger this code.
> > > >
> > > > But if the nohz_full CPU stays in the kernel for an extended time,
> > > > perhaps due to OOM handling or due to processing of some huge I/O that
> > > > hits in-memory buffers/cache, then RCU needs some way of detecting
> > > > quiescent states on that CPU. This requires the scheduling-clock
> > > > interrupt to be alive and well.
> > > >
> > > > Are there other ways to get this done? But of course! RCU could
> > > > for example use smp_call_function_single() or use workqueues to force
> > > > execution onto that CPU and enable the tick that way. This gets a
> > > > little involved in order to avoid deadlock, but if the added check
> > > > in rcu_nmi_enter() is causing trouble, something can be arranged.
> > > > Though that something would cause more latency excursions than
> > > > does the current code.
> > > >
> > > > Or did you have something else in mind?
> > >
> > > I'm trying to understand when we actually need to call the function.
> > > Is it just the scheduling interrupt that's supposed to call
> > > rcu_irq_enter()? But the scheduling interrupt is off, so I'm
> > > confused.
> >
> > The scheduling-clock interrupt is indeed off, but if execution remains
> > in the kernel for an extended time period, this becomes a problem.
> > RCU quiescent states don't happen, or if they do, they are not reported
> > to RCU. Grace periods never end, and the system eventually OOMs.
> >
> > And it is not all that hard to make a CPU stay in the kernel for minutes
> > at a time on a large system.
> >
> > So what happens is that if RCU notices that a given CPU has not responded
> > in a reasonable time period, it sets that CPU's ->rcu_urgent_qs. This
> > flag plays various roles in various configurations, but on nohz_full CPUs
> > it causes that CPU's next rcu_nmi_enter() invocation to turn that CPU's
> > tick on. It also sets that CPU's ->rcu_forced_tick flag, which prevents
> > redundant turning on of the tick and also causes the quiescent-state
> > detection code to turn off the tick for this CPU.
> >
> > As you say, the scheduling-clock tick cannot turn itself on, but
> > there might be other interrupts, exceptions, and so on that could.
> > And if nothing like that happens (as might well be the case on a
> > well-isolated CPU), RCU will eventually force one. But it waits a few
> > hundred milliseconds in order to take advantage of whatever naturally
> > occurring interrupt might appear in the meantime.
> >
> > Does that help?
>
> Yes, I think. Could this go in a comment in the new function?
Even if we don't go with the new function, evidence indicates that this
commentary should go somewhere. ;-)
Thanx, Paul