Re: [RFC PATCH 1/1] kernel/rcu/tree.c: simplify force_quiescent_state()

From: Paul E. McKenney
Date: Tue Jun 17 2014 - 13:10:17 EST


On Tue, Jun 17, 2014 at 12:01:28PM -0400, Romanov Arya wrote:
> On Tue, Jun 17, 2014 at 10:54 AM, Paul E. McKenney
> <paulmck@xxxxxxxxxxxxxxxxxx> wrote:
> > On Mon, Jun 16, 2014 at 10:55:29PM -0400, Pranith Kumar wrote:
> >> This might sound really naive, but please bear with me.
> >>
> >> force_quiescent_state() used to do a lot of things in the past in addition to
> >> forcing a quiescent state. (In my reading of the mailing list I found state
> >> transitions for one).
> >>
> >> Now according to the code, what is being done is multiple callers try to go up
> >> the hierarchy of nodes to see who reaches the root node. The caller reaching the
> >> root node wins and it acquires root node lock and it gets to set rsp->gp_flags!
> >>
> >> At each level of the hierarchy we try to acquire fqslock. This is the only place
> >> which actually uses fqslock.
> >>
> >> I guess this was being done to avoid the contention on fqslock, but all we are
> >> doing here is setting one flag. This way of acquiring locks might reduce
> >> contention if every update is trying to do some independent work, but here all
> >> we are doing is setting the same flag with same value.
> >
> > Actually, to reduce contention on rnp_root->lock.
> >
> > The trick is that the "losers" at each level of ->fqslock acquisition go
> > away. The "winner" ends up doing the real work of setting RCU_GP_FLAG_FQS.
> >
> >> We can also remove fqslock completely if we do not need this. Also using
> >> cmpxchg() to set the value of the flag looks like a good idea to avoid taking
> >> the root node lock. Thoughts?
> >
> > The ->fqslock funnel was needed to avoid lockups on large systems (many
> > hundreds or even thousands of CPUs). Moving grace-period responsibilities
> > from softirq to the grace-period kthreads might have reduced contention
> > sufficienty to make the ->fqslock funnel unnecessary. However, given
> > that I don't usually have access to such a large system, I will leave it,
> > at least for the time being.
>
> Sounds like a good case study for using the newly introduced MCS based
> locks(qspinlock.h).
> Waiman, Peter?

No. Absolutely not.

Any exclusive lock, MCS or otherwise, will waste a huge amount of time
in this case.

> Btw, is doing the following a bad idea? It reduces contention on
> rnp_root->lock using fqslock
> which seems to be the lock which needs to be taken while forcing a
> quiescent state:
>
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index f1ba773..f5a0e7e 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2401,34 +2401,24 @@ static void force_quiescent_state(struct rcu_state *rsp)
> unsigned long flags;
> bool ret;
> struct rcu_node *rnp;
> - struct rcu_node *rnp_old = NULL;
> -
> - /* Funnel through hierarchy to reduce memory contention. */
> - rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
> - for (; rnp != NULL; rnp = rnp->parent) {
> - ret = (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) ||
> - !raw_spin_trylock(&rnp->fqslock);
> - if (rnp_old != NULL)
> - raw_spin_unlock(&rnp_old->fqslock);
> - if (ret) {
> - ACCESS_ONCE(rsp->n_force_qs_lh)++;
> - return;
> - }
> - rnp_old = rnp;
> + struct rcu_node *rnp_root = rcu_get_root(rsp);
> +
> + if (!raw_spin_trylock(rnp_root->fqslock)) {

This will be an epic fail on huge systems because it can result in massive
memory contention. And yes, I have recently reduced the probability of
large numbers of concurrent calls to force_quiescent_state(), but this
situation really can still happen, for example, if a bunch of CPUs are
doing lots of concurrent call_rcu() invocations.

Pranith, Romanov: You do -not-, repeat -not-, get to shoot from the hip
with this code. You absolutely need to understand what it is doing and
why before you try hacking on it. Otherwise, all that will happen is
that you will come up with additional creative ways to break RCU. Your
commit log and comments will need to clearly indicate that you understand
what happens if (say) 4096 CPUs all call force_quiescent_state() at the
same time.

Thanx, Paul

> + ACCESS_ONCE(rsp->n_force_qs_lh)++;
> + return; /* Someone is already trying to force */
> }
> - /* rnp_old == rcu_get_root(rsp), rnp == NULL. */
>
> - /* Reached the root of the rcu_node tree, acquire lock. */
> - raw_spin_lock_irqsave(&rnp_old->lock, flags);
> - smp_mb__after_unlock_lock();
> - raw_spin_unlock(&rnp_old->fqslock);
> if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
> ACCESS_ONCE(rsp->n_force_qs_lh)++;
> - raw_spin_unlock_irqrestore(&rnp_old->lock, flags);
> + raw_spin_unlock(rnp_root->fqslock);
> return; /* Someone beat us to it. */
> }
> +
> + /* Reached the root of the rcu_node tree, acquire lock. */
> + raw_spin_lock_irqsave(&rnp_root->lock, flags);
> + smp_mb__after_unlock_lock();
> ACCESS_ONCE(rsp->gp_flags) |= RCU_GP_FLAG_FQS;
> - raw_spin_unlock_irqrestore(&rnp_old->lock, flags);
> + raw_spin_unlock_irqrestore(&rnp_root->lock, flags);
> wake_up(&rsp->gp_wq); /* Memory barrier implied by wake_up() path. */
> }
>
> Regards,
> Romanov
>
> >
> > But you might be interested in thinking through what else would need to
> > change in order to make cmpxchg() work. ;-)
> >
> > Thanx, Paul
> >
> >> Signed-off-by: Pranith Kumar <bobby.prani@xxxxxxxxx>
> >> ---
> >> kernel/rcu/tree.c | 35 +++++++++++++----------------------
> >> 1 file changed, 13 insertions(+), 22 deletions(-)
> >>
> >> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> >> index f1ba773..9a46f32 100644
> >> --- a/kernel/rcu/tree.c
> >> +++ b/kernel/rcu/tree.c
> >> @@ -2399,36 +2399,27 @@ static void force_qs_rnp(struct rcu_state *rsp,
> >> static void force_quiescent_state(struct rcu_state *rsp)
> >> {
> >> unsigned long flags;
> >> - bool ret;
> >> - struct rcu_node *rnp;
> >> - struct rcu_node *rnp_old = NULL;
> >> -
> >> - /* Funnel through hierarchy to reduce memory contention. */
> >> - rnp = per_cpu_ptr(rsp->rda, raw_smp_processor_id())->mynode;
> >> - for (; rnp != NULL; rnp = rnp->parent) {
> >> - ret = (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) ||
> >> - !raw_spin_trylock(&rnp->fqslock);
> >> - if (rnp_old != NULL)
> >> - raw_spin_unlock(&rnp_old->fqslock);
> >> - if (ret) {
> >> - ACCESS_ONCE(rsp->n_force_qs_lh)++;
> >> - return;
> >> - }
> >> - rnp_old = rnp;
> >> + struct rcu_node *rnp_root = rcu_get_root(rsp);
> >> +
> >> + /* early test to see if someone already forced a quiescent state
> >> + */
> >> + if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
> >> + ACCESS_ONCE(rsp->n_force_qs_lh)++;
> >> + return; /* Someone beat us to it. */
> >> }
> >> - /* rnp_old == rcu_get_root(rsp), rnp == NULL. */
> >>
> >> /* Reached the root of the rcu_node tree, acquire lock. */
> >> - raw_spin_lock_irqsave(&rnp_old->lock, flags);
> >> + raw_spin_lock_irqsave(&rnp_root->lock, flags);
> >> smp_mb__after_unlock_lock();
> >> - raw_spin_unlock(&rnp_old->fqslock);
> >> if (ACCESS_ONCE(rsp->gp_flags) & RCU_GP_FLAG_FQS) {
> >> ACCESS_ONCE(rsp->n_force_qs_lh)++;
> >> - raw_spin_unlock_irqrestore(&rnp_old->lock, flags);
> >> - return; /* Someone beat us to it. */
> >> + raw_spin_unlock_irqrestore(&rnp_root->lock, flags);
> >> + return; /* Someone actually beat us to it. */
> >> }
> >> +
> >> + /* can we use cmpxchg instead of the above lock? */
> >> ACCESS_ONCE(rsp->gp_flags) |= RCU_GP_FLAG_FQS;
> >> - raw_spin_unlock_irqrestore(&rnp_old->lock, flags);
> >> + raw_spin_unlock_irqrestore(&rnp_root->lock, flags);
> >> wake_up(&rsp->gp_wq); /* Memory barrier implied by wake_up() path. */
> >> }
> >>
> >> --
> >> 1.9.1
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/