Re: [PATCH RFC nohz_full 6/7] nohz_full: Add full-system-idle statemachine

From: Paul E. McKenney
Date: Wed Jul 17 2013 - 20:41:54 EST


On Thu, Jul 18, 2013 at 01:31:21AM +0200, Frederic Weisbecker wrote:
> On Mon, Jul 08, 2013 at 06:30:05PM -0700, Paul E. McKenney wrote:
> > }
> >
> > /*
> > + * Unconditionally force exit from full system-idle state. This is
> > + * invoked when a normal CPU exits idle, but must be called separately
> > + * for the timekeeping CPU (tick_do_timer_cpu). The reason for this
> > + * is that the timekeeping CPU is permitted to take scheduling-clock
> > + * interrupts while the system is in system-idle state, and of course
> > + * rcu_sysidle_exit() has no way of distinguishing a scheduling-clock
> > + * interrupt from any other type of interrupt.
> > + */
> > +void rcu_sysidle_force_exit(void)
> > +{
> > + int oldstate = ACCESS_ONCE(full_sysidle_state);
> > + int newoldstate;
> > +
> > + /*
> > + * Each pass through the following loop attempts to exit full
> > + * system-idle state. If contention proves to be a problem,
> > + * a trylock-based contention tree could be used here.
> > + */
> > + while (oldstate > RCU_SYSIDLE_SHORT) {
>
> I'm missing a key here.
>
> Let's imagine that the timekeeper has finally set full_sysidle_state = RCU_SYSIDLE_FULL_NOTED
> with cmpxchg, what guarantees that this CPU is not seeing a stale RCU_SYSIDLE_SHORT value
> for example?

Good question! Let's see if I have a reasonable answer. ;-)

I am going to start with the large-CPU case, so that the state is advanced
only by the grace-period kthread.

1. Case 1: the timekeeper CPU invoked rcu_sysidle_force_exit().
In this case, this is the same CPU that set full_sysidle_state
to RCU_SYSIDLE_FULL_NOTED, so it is guaranteed not to see a
stale value.

2. Case 2: Some CPU came out of idle, and invoked rcu_sysidle_exit().
In this case, if this CPU reads a RCU_SYSIDLE_SHORT from
full_sysidle_state, this read must have come before the
cmpxchg() (on some other CPU) that set it to RCU_SYSIDLE_LONG.
Because this CPU's read from full_sysidle_state was preceded by
an atomic_inc() that updated this CPU's ->dynticks_idle, that
update must also precede the cmpxchg() that set full_sysidle_state
to RCU_SYSIDLE_LONG. Because the state advancing is done from
within a single thread, the subsequent scan is guaranteed to see
the first CPU's update of ->dynticks_idle, and will therefore
refrain from advancing full_sysidle_state to RCU_SYSIDLE_FULL.

This will in turn prevent the timekeeping thread from advancing
the state to RCU_SYSIDLE_FULL_NOTED, so this scenario cannot
happen.

Unfortunately, the reasoning in #2 above does not hold in the small-CPU
case because there is the possibility of both the timekeeping CPU and
the RCU grace-period kthread concurrently advancing the state machine.
This would be bad, good catch!!!

The patch below (untested) is an attempt to fix this. If it actually
works, I will merge it in with 6/7.

Anything else I missed? ;-)

Thanx, Paul

------------------------------------------------------------------------

Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index aa3f525..fe83085 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1364,7 +1364,7 @@ int rcu_gp_fqs(struct rcu_state *rsp, int fqs_state_in)
}
force_qs_rnp(rsp, dyntick_save_progress_counter,
&isidle, &maxj);
- rcu_sysidle_report(rsp, isidle, maxj);
+ rcu_sysidle_report_gp(rsp, isidle, maxj);
fqs_state = RCU_FORCE_QS;
} else {
/* Handle dyntick-idle and offline CPUs. */
diff --git a/kernel/rcutree.h b/kernel/rcutree.h
index 1602c21..657b415 100644
--- a/kernel/rcutree.h
+++ b/kernel/rcutree.h
@@ -559,8 +559,8 @@ static void rcu_sysidle_check_cpu(struct rcu_data *rdp, bool *isidle,
unsigned long *maxj);
static bool is_sysidle_rcu_state(struct rcu_state *rsp);
static void rcu_bind_gp_kthread(void);
-static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
- unsigned long maxj);
+static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
+ unsigned long maxj);
static void rcu_sysidle_init_percpu_data(struct rcu_dynticks *rdtp);

#endif /* #ifndef RCU_TREE_NONCORE */
diff --git a/kernel/rcutree_plugin.h b/kernel/rcutree_plugin.h
index a4d44c3..f65d9c2 100644
--- a/kernel/rcutree_plugin.h
+++ b/kernel/rcutree_plugin.h
@@ -2655,14 +2655,22 @@ static void rcu_sysidle_cancel(void)
* scan of the CPUs' dyntick-idle state.
*/
static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
- unsigned long maxj)
+ unsigned long maxj, bool gpkt)
{
if (rsp != rcu_sysidle_state)
return; /* Wrong flavor, ignore. */
- if (isidle)
- rcu_sysidle(maxj); /* More idle! */
- else
+ if (isidle) {
+ if (gpkt && nr_cpu_ids > RCU_SYSIDLE_SMALL)
+ rcu_sysidle(maxj); /* More idle! */
+ } else {
rcu_sysidle_cancel(); /* Idle is over. */
+ }
+}
+
+static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
+ unsigned long maxj)
+{
+ rcu_sysidle_report(rsp, isidle, maxj, true);
}

/* Callback and function for forcing an RCU grace period. */
@@ -2713,7 +2721,8 @@ bool rcu_sys_is_idle(void)
if (!isidle)
break;
}
- rcu_sysidle_report(rcu_sysidle_state, isidle, maxj);
+ rcu_sysidle_report(rcu_sysidle_state,
+ isidle, maxj, false);
oldrss = rss;
rss = ACCESS_ONCE(full_sysidle_state);
}
@@ -2776,8 +2785,8 @@ static void rcu_bind_gp_kthread(void)
{
}

-static void rcu_sysidle_report(struct rcu_state *rsp, int isidle,
- unsigned long maxj)
+static void rcu_sysidle_report_gp(struct rcu_state *rsp, int isidle,
+ unsigned long maxj)
{
}


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/