Re: [RFC 2/2] rcu: Remove ->dynticks_nmi_nesting from struct rcu_dynticks
From: Paul E. McKenney
Date: Wed Jun 20 2018 - 12:47:14 EST
On Thu, Jun 21, 2018 at 01:05:22AM +0900, Byungchul Park wrote:
> On Wed, Jun 20, 2018 at 11:58 PM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > On Wed, Jun 20, 2018 at 05:47:20PM +0900, Byungchul Park wrote:
> >> Hello folks,
> >>
> >> I'm careful in saying that ->dynticks_nmi_nesting can be removed but I
> >> think it's possible since the only thing we are interested in with
> >> regard to ->dynticks_nesting or ->dynticks_nmi_nesting is whether rcu is
> >> idle or not.
> >
> > Please keep in mind that NMIs cannot be masked, which means that the
> > rcu_nmi_enter() and rcu_nmi_exit() pair can be invoked at any point in
> > the process, between any consecutive pair of instructions. The saving
And yes, I should have looked at this patch more closely before replying.
But please see below.
> I believe I understand what NMI is and why you introduced
> ->dynticks_nmi_nesting. Or am I missing something?
Perhaps the fact that there are architectures that can enter interrupt
handlers and never leave them when the CPU is non-idle. One example of
this is the usermode upcalls in the comment that you removed.
Or have all the architectures been modified so that each and every call to
rcu_irq_enter() and to rcu_irq_exit() are now properly paired and nested?
Proper nesting and pairing was -not- present in the past, hence the
special updates (AKA "crowbar") to the counters when transitioning to
and from idle.
If proper nesting and pairing of rcu_irq_enter() and rcu_irq_exit()
is now fully in force across all architectures and configurations, the
commit log needs to say this, preferably pointing to the corresponding
commits that made this change.
> > grace is that these two functions restore state, but you cannot make them.
> > After all, NMI does stand for non-maskable interrupt.
>
> Excuse me, but I think I've considered that all. Could you show me
> what problem can happen with this?
There is a call to rcu_irq_enter() without a corresponding call to
rcu_irq_exit(), so that the ->dynticks_nesting counter never goes back to
zero so that the next time this CPU goes idle, RCU thinks that the CPU
is still non-idle. This can result in excessively long grace periods
and needless IPIs to idle CPUs.
> These two functions, rcu_nmi_enter() and rcu_nmi_exit(), would still
> save and restore the state with ->dynticks_nesting.
As far as I know, rcu_nmi_enter() and rcu_nmi_exit() -are- properly
paired and nested across all architectures and configurations, so yes,
they do act more naturally.
> Even though I made ->dynticks_nesting shared between NMI and
> other contexts entering or exiting eqs, I believe it's not a problem
> because anyway the variable would be updated and finally restored
> in a *nested* manner.
But only if calls to rcu_irq_enter() and rcu_irq_exit() are now always
properly paired and nested, which was definitely -not- the case last
I looked.
> > At first glance, the code below does -not- take this into account.
>
> Excuse me, but could explain it more? I don't understand your point :(
>
> > What am I missing that would make this change safe? (You would need to
> > convince both me and Andy Lutomirski, who I have added on CC.)
>
> Thanks for doing that.
>
> >> And I'm also afraid if the assumption is correct for every archs which I
> >> based on, that is, an assignment operation on a single aligned word is
> >> atomic in terms of instruction.
> >
> > The key point is that both ->dynticks_nesting and ->dynticks_nmi_nesting
> > are accessed only by the corresponding CPU (other than debug prints).
> > Load and store tearing should therefore not be a problem. Are there
>
> Right. But I thought it can be a problem between NMI and other contexts
> because I made ->dynticks_nesting shared between NMI and others.
>
> > other reasons to promote to READ_ONCE() and WRITE_ONCE()? If there are,
> > a separate patch doing that promotion would be good.
>
> But the promotion is meaningless without making ->dynticks_nesting
> shared as you said. I'm afraid it's too dependent on this patch to
> separate it.
>
> I'm sorry I don't understand your point. It would be very appreciated if
> you explain it more about what I'm missing or your point :(
OK, so I can further consider this pair of patches only if
all architectures now properly pair and nest rcu_irq_enter() and
rcu_irq_exit(). It would be very good if they did, but actually testing
back in the day showed that they did not. If that has changed, that
would be a very good thing, but if not, this patch injects bugs.
Thanx, Paul
> >> Thoughs?
> >>
> >> ----->8-----
> >> >From 84970b33eb06c3bb1bebbb1754db405c0fc50fbe Mon Sep 17 00:00:00 2001
> >> From: Byungchul Park <byungchul.park@xxxxxxx>
> >> Date: Wed, 20 Jun 2018 16:01:20 +0900
> >> Subject: [RFC 2/2] rcu: Remove ->dynticks_nmi_nesting from struct rcu_dynticks
> >>
> >> The only thing we are interested in with regard to ->dynticks_nesting or
> >> ->dynticks_nmi_nesting is whether rcu is idle or not, which can be
> >> handled only using ->dynticks_nesting though. ->dynticks_nmi_nesting is
> >> unnecessary but to make the code more complicated.
> >>
> >> This patch makes both rcu_eqs_{enter,exit}() and rcu_nmi_{enter,exit}()
> >> count up and down a single variable, ->dynticks_nesting to keep how many
> >> rcu non-idle sections have been nested.
> >>
> >> As a result, no matter who made the variable be non-zero, it's anyway
> >> non-idle, and it can be considered as just having been idle once the
> >> variable is equal to zero. So tricky code can be removed.
> >>
> >> In addition, it was assumed that an assignment operation on a single
> >> aligned word is atomic so that ->dynticks_nesting can be safely assigned
> >> within both nmi context and others concurrently.
> >>
> >> Signed-off-by: Byungchul Park <byungchul.park@xxxxxxx>
> >> ---
> >> kernel/rcu/tree.c | 76 ++++++++++++++++++++----------------------------
> >> kernel/rcu/tree.h | 4 +--
> >> kernel/rcu/tree_plugin.h | 4 +--
> >> 3 files changed, 35 insertions(+), 49 deletions(-)
> >>
> >> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >> index 59ae94e..61f203a 100644
> >> --- a/kernel/rcu/tree.c
> >> +++ b/kernel/rcu/tree.c
> >> @@ -260,7 +260,6 @@ void rcu_bh_qs(void)
> >>
> >> static DEFINE_PER_CPU(struct rcu_dynticks, rcu_dynticks) = {
> >> .dynticks_nesting = 1,
> >> - .dynticks_nmi_nesting = DYNTICK_IRQ_NONIDLE,
> >> .dynticks = ATOMIC_INIT(RCU_DYNTICK_CTRL_CTR),
> >> };
> >>
> >> @@ -694,10 +693,6 @@ static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
> >> /*
> >> * Enter an RCU extended quiescent state, which can be either the
> >> * idle loop or adaptive-tickless usermode execution.
> >> - *
> >> - * We crowbar the ->dynticks_nmi_nesting field to zero to allow for
> >> - * the possibility of usermode upcalls having messed up our count
> >> - * of interrupt nesting level during the prior busy period.
> >> */
> >> static void rcu_eqs_enter(bool user)
> >> {
> >> @@ -706,11 +701,11 @@ static void rcu_eqs_enter(bool user)
> >> struct rcu_dynticks *rdtp;
> >>
> >> rdtp = this_cpu_ptr(&rcu_dynticks);
> >> - WRITE_ONCE(rdtp->dynticks_nmi_nesting, 0);
> >> WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
> >> rdtp->dynticks_nesting == 0);
> >> if (rdtp->dynticks_nesting != 1) {
> >> - rdtp->dynticks_nesting--;
> >> + WRITE_ONCE(rdtp->dynticks_nesting, /* No store tearing. */
> >> + rdtp->dynticks_nesting - 1);
> >> return;
> >> }
> >>
> >> @@ -767,7 +762,7 @@ void rcu_user_enter(void)
> >> * rcu_nmi_exit - inform RCU of exit from NMI context
> >> *
> >> * If we are returning from the outermost NMI handler that interrupted an
> >> - * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nmi_nesting
> >> + * RCU-idle period, update rdtp->dynticks and rdtp->dynticks_nesting
> >> * to let the RCU grace-period handling know that the CPU is back to
> >> * being RCU-idle.
> >> *
> >> @@ -779,21 +774,21 @@ void rcu_nmi_exit(void)
> >> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >>
> >> /*
> >> - * Check for ->dynticks_nmi_nesting underflow and bad ->dynticks.
> >> + * Check for ->dynticks_nesting underflow and bad ->dynticks.
> >> * (We are exiting an NMI handler, so RCU better be paying attention
> >> * to us!)
> >> */
> >> - WARN_ON_ONCE(rdtp->dynticks_nmi_nesting <= 0);
> >> + WARN_ON_ONCE(rdtp->dynticks_nesting <= 0);
> >> WARN_ON_ONCE(rcu_dynticks_curr_cpu_in_eqs());
> >>
> >> /*
> >> * If the nesting level is not 1, the CPU wasn't RCU-idle, so
> >> * leave it in non-RCU-idle state.
> >> */
> >> - if (rdtp->dynticks_nmi_nesting != 1) {
> >> - trace_rcu_dyntick(TPS("--="), rdtp->dynticks_nmi_nesting, rdtp->dynticks_nmi_nesting - 2, rdtp->dynticks);
> >> - WRITE_ONCE(rdtp->dynticks_nmi_nesting, /* No store tearing. */
> >> - rdtp->dynticks_nmi_nesting - 2);
> >> + if (rdtp->dynticks_nesting != 1) {
> >> + trace_rcu_dyntick(TPS("--="), rdtp->dynticks_nesting, rdtp->dynticks_nesting - 1, rdtp->dynticks);
> >> + WRITE_ONCE(rdtp->dynticks_nesting, /* No store tearing. */
> >> + rdtp->dynticks_nesting - 1);
> >> return;
> >> }
> >>
> >> @@ -803,8 +798,8 @@ void rcu_nmi_exit(void)
> >> }
> >>
> >> /* This NMI interrupted an RCU-idle CPU, restore RCU-idleness. */
> >> - trace_rcu_dyntick(TPS("Startirq"), rdtp->dynticks_nmi_nesting, 0, rdtp->dynticks);
> >> - WRITE_ONCE(rdtp->dynticks_nmi_nesting, 0); /* Avoid store tearing. */
> >> + trace_rcu_dyntick(TPS("Startirq"), rdtp->dynticks_nesting, 0, rdtp->dynticks);
> >> + WRITE_ONCE(rdtp->dynticks_nesting, 0); /* Avoid store tearing. */
> >> rcu_dynticks_eqs_enter();
> >> }
> >>
> >> @@ -851,10 +846,6 @@ void rcu_irq_exit_irqson(void)
> >> /*
> >> * Exit an RCU extended quiescent state, which can be either the
> >> * idle loop or adaptive-tickless usermode execution.
> >> - *
> >> - * We crowbar the ->dynticks_nmi_nesting field to DYNTICK_IRQ_NONIDLE to
> >> - * allow for the possibility of usermode upcalls messing up our count of
> >> - * interrupt nesting level during the busy period that is just now starting.
> >> */
> >> static void rcu_eqs_exit(bool user)
> >> {
> >> @@ -866,7 +857,8 @@ static void rcu_eqs_exit(bool user)
> >> oldval = rdtp->dynticks_nesting;
> >> WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && oldval < 0);
> >> if (oldval) {
> >> - rdtp->dynticks_nesting++;
> >> + WRITE_ONCE(rdtp->dynticks_nesting, /* No store tearing. */
> >> + rdtp->dynticks_nesting + 1);
> >> return;
> >> }
> >> rcu_dynticks_task_exit();
> >> @@ -875,7 +867,6 @@ static void rcu_eqs_exit(bool user)
> >> trace_rcu_dyntick(TPS("End"), rdtp->dynticks_nesting, 1, rdtp->dynticks);
> >> WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && !user && !is_idle_task(current));
> >> WRITE_ONCE(rdtp->dynticks_nesting, 1);
> >> - WRITE_ONCE(rdtp->dynticks_nmi_nesting, DYNTICK_IRQ_NONIDLE);
> >> }
> >>
> >> /**
> >> @@ -915,11 +906,11 @@ void rcu_user_exit(void)
> >> /**
> >> * rcu_nmi_enter - inform RCU of entry to NMI context
> >> *
> >> - * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks and
> >> - * rdtp->dynticks_nmi_nesting to let the RCU grace-period handling know
> >> - * that the CPU is active. This implementation permits nested NMIs, as
> >> - * long as the nesting level does not overflow an int. (You will probably
> >> - * run out of stack space first.)
> >> + * If the CPU was idle from RCU's viewpoint, update rdtp->dynticks. And
> >> + * then update rdtp->dynticks_nesting to let the RCU grace-period handling
> >> + * know that the CPU is active. This implementation permits nested NMIs,
> >> + * as long as the nesting level does not overflow an int. (You will
> >> + * probably run out of stack space first.)
> >> *
> >> * If you add or remove a call to rcu_nmi_enter(), be sure to test
> >> * with CONFIG_RCU_EQS_DEBUG=y.
> >> @@ -927,33 +918,29 @@ void rcu_user_exit(void)
> >> void rcu_nmi_enter(void)
> >> {
> >> struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
> >> - long incby = 2;
> >>
> >> /* Complain about underflow. */
> >> - WARN_ON_ONCE(rdtp->dynticks_nmi_nesting < 0);
> >> + WARN_ON_ONCE(rdtp->dynticks_nesting < 0);
> >>
> >> - /*
> >> - * If idle from RCU viewpoint, atomically increment ->dynticks
> >> - * to mark non-idle and increment ->dynticks_nmi_nesting by one.
> >> - * Otherwise, increment ->dynticks_nmi_nesting by two. This means
> >> - * if ->dynticks_nmi_nesting is equal to one, we are guaranteed
> >> - * to be in the outermost NMI handler that interrupted an RCU-idle
> >> - * period (observation due to Andy Lutomirski).
> >> - */
> >> if (rcu_dynticks_curr_cpu_in_eqs()) {
> >> rcu_dynticks_eqs_exit();
> >> - incby = 1;
> >>
> >> if (!in_nmi()) {
> >> rcu_dynticks_task_exit();
> >> rcu_cleanup_after_idle();
> >> }
> >> }
> >> - trace_rcu_dyntick(incby == 1 ? TPS("Endirq") : TPS("++="),
> >> - rdtp->dynticks_nmi_nesting,
> >> - rdtp->dynticks_nmi_nesting + incby, rdtp->dynticks);
> >> - WRITE_ONCE(rdtp->dynticks_nmi_nesting, /* Prevent store tearing. */
> >> - rdtp->dynticks_nmi_nesting + incby);
> >> +
> >> + trace_rcu_dyntick(rdtp->dynticks_nesting ? TPS("++=") : TPS("Endirq"),
> >> + rdtp->dynticks_nesting,
> >> + rdtp->dynticks_nesting + 1, rdtp->dynticks);
> >> + /*
> >> + * If ->dynticks_nesting is equal to one on rcu_nmi_exit(), we are
> >> + * guaranteed to be in the outermost NMI handler that interrupted
> >> + * an RCU-idle period (observation due to Andy Lutomirski).
> >> + */
> >> + WRITE_ONCE(rdtp->dynticks_nesting, /* Prevent store tearing. */
> >> + rdtp->dynticks_nesting + 1);
> >> barrier();
> >> }
> >>
> >> @@ -1089,8 +1076,7 @@ bool rcu_lockdep_current_cpu_online(void)
> >> */
> >> static int rcu_is_cpu_rrupt_from_idle(void)
> >> {
> >> - return __this_cpu_read(rcu_dynticks.dynticks_nesting) <= 0 &&
> >> - __this_cpu_read(rcu_dynticks.dynticks_nmi_nesting) <= 1;
> >> + return __this_cpu_read(rcu_dynticks.dynticks_nesting) <= 1;
> >> }
> >>
> >> /*
> >> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> >> index 4e74df7..071afe4 100644
> >> --- a/kernel/rcu/tree.h
> >> +++ b/kernel/rcu/tree.h
> >> @@ -38,8 +38,8 @@
> >> * Dynticks per-CPU state.
> >> */
> >> struct rcu_dynticks {
> >> - long dynticks_nesting; /* Track process nesting level. */
> >> - long dynticks_nmi_nesting; /* Track irq/NMI nesting level. */
> >> + long dynticks_nesting __attribute__((aligned(sizeof(long))));
> >> + /* Track process nesting level. */
> >> atomic_t dynticks; /* Even value for idle, else odd. */
> >> bool rcu_need_heavy_qs; /* GP old, need heavy quiescent state. */
> >> unsigned long rcu_qs_ctr; /* Light universal quiescent state ctr. */
> >> diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
> >> index c1b17f5..0c57e50 100644
> >> --- a/kernel/rcu/tree_plugin.h
> >> +++ b/kernel/rcu/tree_plugin.h
> >> @@ -1801,7 +1801,7 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu)
> >> }
> >> print_cpu_stall_fast_no_hz(fast_no_hz, cpu);
> >> delta = rcu_seq_ctr(rdp->mynode->gp_seq - rdp->rcu_iw_gp_seq);
> >> - pr_err("\t%d-%c%c%c%c: (%lu %s) idle=%03x/%ld/%#lx softirq=%u/%u fqs=%ld %s\n",
> >> + pr_err("\t%d-%c%c%c%c: (%lu %s) idle=%03x/%ld softirq=%u/%u fqs=%ld %s\n",
> >> cpu,
> >> "O."[!!cpu_online(cpu)],
> >> "o."[!!(rdp->grpmask & rdp->mynode->qsmaskinit)],
> >> @@ -1811,7 +1811,7 @@ static void print_cpu_stall_info(struct rcu_state *rsp, int cpu)
> >> "!."[!delta],
> >> ticks_value, ticks_title,
> >> rcu_dynticks_snap(rdtp) & 0xfff,
> >> - rdtp->dynticks_nesting, rdtp->dynticks_nmi_nesting,
> >> + rdtp->dynticks_nesting,
> >> rdp->softirq_snap, kstat_softirqs_cpu(RCU_SOFTIRQ, cpu),
> >> READ_ONCE(rsp->n_force_qs) - rsp->n_force_qs_gpstart,
> >> fast_no_hz);
> >> --
> >> 1.9.1
> >>
> >
>
>
>
> --
> Thanks,
> Byungchul
>